Re: disk storage solution

Martha Mueller · ‎08-04-2000

On my N4000 with 3 440 MHz cpu's and 4 gig RAM, I am using jamaica subsystems (product A3312AZ) from Fast/Wide/Differential controller cards (A4800A). The data transfer speed is 20 MB/sec. These disks are mirrored and hold database data. I have two tempdb areas that are striped, not mirrored, and the data is on filesystem (vxfs), not raw partition, as the rest of the data is. Sybase suggestion so that the OS could help with some of the load. It did help the throughput. These tempdb areas are each on separate controllers, with no other data. They are striped across two and four disks, respectively. The disks are not busy, as shown by perfview's metric BYDSK_UTIL, but the request queue, BYDSK_REQUEST_QUEUE, is constantly around 20. Alan Riggs had mentioned something about the vxfs journal log having to be written at one location on the disk and therefore mechanical head movement may account for the long request queue. CPU usage is about 50%. Does anyone else have an idea, and does anyone have any suggestions to alleviate the bottleneck?

Thanks.

Martha

John Palmer · ‎08-04-2000

What vxfs mount options are you using? Options tmplog and nolog are available (see man mount_vxfs) .

Regards,

John

Tim Malnati · ‎08-04-2000

What you have pictured is about as far as you can accomplish with a Jamaica enclosure. More spindles in the stripe set could help, but I doubt you would gain much, particularly with the cost involved. Sounds like time to start considering an array where some hardware cache can help things. Avoid the model 12 though (it's slower if anything).

Martha Mueller · ‎08-04-2000

John,

/dev/volumegroup/logicalvolume /tmpdbdata vxfs rw,suid,largefiles,convosync=delay,mincache=tmpcache 0 2

I hope this displays properly, but it should be all on one line.

Thanks.

Martha

Martha Mueller · ‎08-04-2000

Tim,

Would you have any recommendations? I would probably like to stick with SCSI, due to the cost of Fibre Channel.

Thanks.

Martha

Tim Malnati · ‎08-04-2000

I'm an EMC guy, but I doubt you want spend that kind of money. They are just not very cost effective until you have significant storage needs. The same is true with the XP256 although the initial frame cost is less. Avoid the model 12h; it's slower than the Jamaica's. I've always liked the model 20 array in SCSI. But my thought is that the info that I have is more than a year old. Maybe HP has upgraded this array to improve the connection throughput or maybe HP has come out with something else to more effectively handle 80 mbs (ultra) drives. Your sales channel is probably the best bet.

Stefan Farrelly · ‎08-04-2000

What speed are the disks on your Jamaica ? They can vary by a huge amount. Use ioscan -fknCdisk to find out. The newer models have much larger cache and are considerably faster, here is a table of the different Jamica disks;

Disk_ProductID Size Speed(1)
========================
ST15150W 4Gb ~6.5 MB/s
ST34371W 4Gb ~8.5 MB/s
ST34572WC 4Gb ~10 MB/s
ST34573WC 4GB ~14 MB/s

ST19171 9GB ~10 MB/s
ST39173WC 9GB ~14.5 MB/s
ST39175LC 9GB ~18 MB/s
(1) using time dd if=/dev/rdsk/xxx

As you can see some of the newer models are massively faster than the older ones. We just replaced some of ours here with the 18 MB/s 9Gb models and the performance on our striped lvols increased wonderfully!

Im from Palmerston North, New Zealand, but somehow ended up in London...

Martha Mueller · ‎08-04-2000

Stefan,

That is very important information. I have a mixture of disks in one of the tempdb areas: the slowest is ST34572 at 10 MB/s. Since this is a striped logical volume, can I assume that the entire logical volume can only write at this slowest disk's speed?

Stefan Farrelly · ‎08-04-2000

Hi Martha,

yes, indeed, any striped lvol will be contrained by the slowest disk in the stripeset. Try using pvmove online to move any high usage lvols to the fastest disks.

If you have a good HP VAR they may swap your disks for the faster ones at a not too expensive price because they can reuse yours in part exchange. Its certainly cheaper than buying and configuring a new disk subsystem. Also whenever we lose a Jamaica disk I always ask for the fastest model as a replacement. Has worked so far and its free!

Cheers,

Stefan

Im from Palmerston North, New Zealand, but somehow ended up in London...

Martha Mueller · ‎08-04-2000

Stefan,

This is very enlightening. Would this explain the phenomenon of the long request queue but the disk not busy? Or do I need to look furthur for that answer?

Thanks.

Martha

Alan Riggs · ‎08-04-2000

Hi Martha.

The performance drop from the slowest disk in the volume will affect spin time and data transfer, so it might be a contributing factor. I would be surpised, though, if it accounted for the full problem. I have a couple of questions:

1) do all disks in the 4 disk stripe show similar UTIL/QUEUE pattern?
2) does sar -d report a similar uutilization pattern (sar queries different structures than the midaemon).
3) What are the BYDSK_AVG_SERVICE_TIMEs for the disks? Is there much divergence between the 4 disks?
4) What is the BYDSK_PHYS_IO_RATE for each disk?

Unfortunately, there are seldom quick and easy anserw to these type of performance issues. But the aboe wuestions might help pin something down.

BTW: please say hello to Sandy, Bob, et al fpr me. I hope you are all doing well.

Martha Mueller · ‎08-04-2000

Hi, Alan, nice to hear from you.

1. They are identical.

2. sar -d shows patterns that the disks activities are similar to each other, but the patterns are not necessarily similar to those shown by perfview. The average queue length on every disk, not just the ones in question, is 0.50 from sar. The percent busy for all four disks shown by sar was lower than that shown by perfview, but I don't have a good sar sample...I just ran it a few minutes.

3. BYDSK_AVG_SERVICE times as shown by glanceplus are about 2.2 msec, but, again, I don't have a long collection time. This metric isn't available from perfview. The four disks are within 0.5 msec of each other.

4. BYDSK_PHYS_IO_RATE is averaging around 4 requests per second, with gusts up to 25.

Alan Riggs · ‎08-04-2000

hmmm . . . what about cache hit ratio for reads and writes?

Martha Mueller · ‎08-04-2000

sar -b shows %rcache around 99 and %wcache around 90.

Alan Riggs · ‎08-04-2000

See, I told you this wouldn't be simple. I am a bit swamped ATM, but I will do some poking through a reference or two. Probably won't have anything meaningful before Monday -- maybe someone else can come up with something quicker. There are some smart folks hanging around here some days.

Martha Mueller · ‎08-04-2000

I agree with both points...Monday and the VERY smart people out there that are generously sharing knowledge. I have been spending most of the week just going through the answers on this forum and jotting down notes.

Dragan Krnic · ‎08-06-2000

I wouldt like to comment in general what I observed about disks under hp-ux:

RAIDs are nonsense. They are supposed to reduce the number of disks, like disks are
expensive. Most often the cost of the RAID controller is even higher than the value of
disks it was supposed to save. In a recent offer for 140 GB net, a model 12 with 5
disks, 36 GB each, spares 3 disks at level 5 versus straight mirroring (which can be
accomplished by Mirror/UX). The saving of 3 disks ($7,150) was more than offset by
the cost of the controller ($13,400). However, for this higher price one gets significantly
lower performance.

The RAID controllers perforce introduce moderate to severe latencies and are never
nearly as fast as Mirror/UX.

My advice is, keep away from RAIDs in general. Just a bunch of mirrored fastest disks available, professionally installed in a decently redundant enclosure (multiple power
supplies, multiple fans, multiple SCSI cabling), saves a lot of dough and increases the
performance. For legacy F/W differential systems my supplier uses an SE/Diff adapter
(a passive device) which converts off-the-shelf UW and LVD drives to F/W differential
(HP way). My most recent enclosure was 9 times 73 GB cyclically stripe-mirrored for
a net capacity of 315 GB at the total price of $16,000. The new disks just fly.

Unfortunately, many system bottlenecks are psychological. IT managers tend to build
up their status based on the $$$ invested in the hardware they manage with scant
concern for performance or sound cost/benefit analysis.

Dragan Krnic · ‎08-06-2000

My previous posting was garbled by the editor. 2nd take:

I wouldt like to comment in general what I observed about disks under hp-ux:

RAIDs are nonsense. They are supposed to reduce the number of disks, like disks are expensive. Most often the cost of the RAID controller is even higher than the value of disks it was supposed to save. In a recent offer for 140 GB net, a model 12 with 5 disks, 36 GB each, spares 3 disks at level 5 versus straight mirroring (which can be accomplished by Mirror/UX). The saving of 3 disks ($7,150) was more than offset by the cost of the controller ($13,400). However, for this higher price one gets significantly lower performance.

The RAID controllers perforce introduce moderate to severe latencies and are never
nearly as fast as Mirror/UX.

My advice is, keep away from RAIDs in general. Just a bunch of mirrored fastest disks available, professionally installed in a decently redundant enclosure (multiple power supplies, multiple fans, multiple SCSI cabling), saves a lot of dough and increases the performance. For legacy F/W differential systems my supplier uses an SE/Diff adapter (a passive device) which converts off-the-shelf UW and LVD drives to F/W differential (HP way). My most recent enclosure was 9 times 73 GB cyclically stripe-mirrored for a net capacity of 315 GB at the total price of $16,000. The new disks just fly.

Unfortunately, many system bottlenecks are psychological. IT managers tend to build up their status based on the $$$ invested in the hardware they manage with scant concern for performance or sound cost/benefit analysis.

Stefan Farrelly · ‎08-06-2000

Hi Martha,

a lot has passed since I went home Friday. Have you made any progress ? My only comment is I think the different speed disks will make a big difference. I have seen this happen before. Next step is to move your cricitcal lvols to a stripe set all on the same speed and fastest disks then see how that improves things.

Im afraid I have to disagree with Dragan. Here at HP weve got our Nike Raid disks performing much faster than Jamaicas. The hundreds of MB of cache makes a big difference - especially to write performance. For the extra money you get dual pathing so protecting against controller failure and RAID so more space for your money. We do have slightly more failures, but this is temperature related, keep em cool and their failure rate is same as any other hardware.

Im from Palmerston North, New Zealand, but somehow ended up in London...

Anthony deRito · ‎08-07-2000

Hello Martha, just to cover your bases, have you tried to approach this problem from a different angle? Take a look at the process acually doing all the work. Use Glance and check out the wait states that the process is consuming. Check to see if the process is able to work harder than it is. Look for things like priority, IPC time delays, local network trafic. Is your database using the local UNIX domain protocol or the high overhead TCP/IP to do its local database access? (This is a common problem with local database access.)

It seems your buffer cache hit rates are very good but don't let this fool you. I've seen stranger things.

You say CPU usage is about 50%. If the %wio parameter from sar -u is not very high as you would expect from an I/O bottleneck, then SOMETHING has to be consuming the resources. (I know, that was a general statement.) This approach will also be valid if the time spent in system mode is also higher than expected.

Tony

Martha Mueller · ‎08-07-2000

Thanks for the input. I am looking for just that kind of real-life experience.

Stefan, nothing has changed here. I will need to schedule downtime to make changes...this is a production server. But it sounds like it should produce some good results.

Dragan Krnic · ‎08-07-2000

Hi Stefan, I'm glad you think I'm wrong, so perhaps we can both do something to set the record straight. I've done some tests and timed them, when nobodoy was around to tweak the results.

There are two sets of tests

- linear reads of 128MB from 1, 2, ...9 disks at the same time (col. 1- time in sec to finish the task, col. 2 - throughput MB/s)

- random seeks with single 8 KB reads by 1, 2, ... 10 processes (col. 3 - average ms per 8 KB read, col. 4 - throughput in MB/s):

s/128MB MB/s ms/8KB MB/s
3.77 34.77 9.48 .80
3.77 69.53 10.02 1.60
5.77 76.01 10.18 2.36
6.89 76.07 10.44 3.07
8.60 76.20 11.03 3.63
10.32 76.20 11.64 4.12
12.02 76.31 12.16 4.60
13.73 76.35 12.75 5.02
14.10 75.55 13.52 5.33
14.28 5.60

The serial reads, 128 from 1, 2,...9 drives show that 3 disks are enough to saturate the signle LVD bus. With 9 disks the overhead is high enough to stifle the throughput rate.

Random seeks with reads of 8 KB blocks are done by 1, 2,...10 background processes started simultaneously. Randomization is seeded individually. The seeks extend over all 314,9 GB of a mirrored raw volume.

I enclose the little C program that I used for tests. It measures time with the help of HP's get16cr.s assembly routine which reads the time in tics for more precision. If you don't have it, you may either read the time with "ftime()" or I'll send you the assembly code by eMail.

Please do the tests and time the results, so that everyone can see that Nike RAID is at least as good as if not even better than a JODB.

Dragan Krnic · ‎08-07-2000

The editor coalesced all my spacings to keep the columns aligned. But never mind here is the time series for serial reads:

3.77, 3.77, 5.77, 6.89, 8.60, 10.32, 12.02, 13.73, 14.10

and the series for random seeks-cum-8KB-reads:

9,48, 10.02, 10.18, 10.44, 11.03, 11.64, 12.16, 12.75, 13.52, 14.28

The other two columns are derived values. For the serial reads multiply 128MB with the number of disks and divide with the seconds duration for throughput in MB. For the random seeks-n-reads multiply the reciprocal of the time in ms with 8KB and the number of simultaneous processes to arrive at the throughput.

Attached is the assembly routine for timing.

I would be much obliged, Stefan, if you can show me that a moder RAID from HP is at least as good as JODB.

Martha Mueller · ‎08-07-2000

Stefan, it would be nice if you would be able to give us a head to head comparison of these two storage solutions. Or, perhaps there is already some document comparing storage solutions at this level. The only things I have seen are the sales pitches, and they are suspect at best. For example, the product briefs for the HVD10 and the SC10 both claim to be "HP's lowest cost-per-megabyte non-RAID storage solution." The breif for the HVD10 goes on to say that you can replace the HVD Bus Control Cards with SC10 Ultra 2 Bus Control cards, but it makes things difficult to wade through all this stuff to get to the real performance information.

Martha Mueller · ‎08-07-2000

Tony, I don't understand your question about the database using the local UNIX domain protocol or the TCP/IP to do it's local database access. I am not a DBA, so perhaps this is a question I need to pass along. How can I tell which protocol the sybase database is using? This tempdb area is using filesystem rather than raw partition. Is that the answer that you are looking for? Glance only can show the processes for the entire dataserver, not the individual activities within the database. The wait state for the three unix processes dataserver is usually sleep. I don't know how to look for ipc time delays - ? ipcs -at and check OTIME and CTIME for what? Local network traffic doesn't seem to be a problem.

The sar -u %wio number is generally less than 5, and I agree, there is a mystery here. PerfView shows that system cpu usage is averaging 17% whereas user cpu usage is 11%. These numbers shifted when we upgraded sybase from 11.0.3 to 11.9.2. Sybase has gone to a continuous polling method instead of interrupt driven. And Sybase has told us that cpu usage of up to 80% would be considered acceptable.

The only disks that are showing this long queue length with a low disk utility level are these two tempdb areas. And these areas are very heavily used by sybase. I am stumped as to where the delay is occuring. But Stefan has pointed out some interesting possibilities.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: disk storage solution

disk storage solution