How can I improve my I/O performance ?

Luis Toro · ‎02-17-2005

We are benchmarking a UNIDATA application on an RP8420, using various hardware configs of CPU/RAM (16/16, 24/24, and 32/32). What we're seeing is an I/O bottleneck. CPU util. is low, memory util. peaks at 35% (our max buffered cache percent is set to 30%, but in each test the actual amount of memory for BC increases, since we've gone from 16, to 24 to 32 gb). Disk util. is at 100%. Storage sits on an EMC DMX3000. We're using 34gb, striped metavolumes, using PowerPath across 4 HBAs. The UNIDATA application/database resides in 145gb filesystem, which is in a Volume group with 5 of the metas. What we see is that one of the luns is showing (from sar -d), 98% busy, avque as high as 140, r+w/s of 217, blks/s of 3000-4000, avwait as high as 680, and avserv as high as 74. We are thinking about doing LVM striping across the 5 metas (EMC does not recommend it), but my real question is: have we done all we can to improve I/O ? Is it a bottleneck on the EMC side (the EMC admin states that the IOs are being serviced in a timely manner: 5 - 10ms) ? And finally, being that this is a UNIDATA app (archaic architecture), is this a case of: "it is what it is" ?
Thank you

Luis Toro · ‎02-17-2005

Almost forgot:
The sar output shows statistics for each device path (there are 4 paths to the EMC metas). The metrics I reported are based on one path. Does this paint an accurate picture of whats going on for that device ? Or do I have take an average, or aggregate, of the 4 paths that correspond to the single metavolume ?

Keith Bryson · ‎02-17-2005

Hi Luis

The first thing I would definately do is to reduce your buffer cache. In most instances I have always seen a performance hit with a BC higher that 800Mb!! Yours is running into Gbs now. 600-800Mb is fine, higher that that and it really begins to (at least) impact the OS and OS command set (cp find etc.). Set dbc_min_pct and dbc_max_pct to 1 (320Mb) and 2 (640Mb) respectively using SAM kernel tuning (or whatever...).

Also check that you are alternating paths for each LUN, rather than pushing data down the same controller.

What is on the busy LUN? Check /etc/fstab for the mount options for this filesystem too.

Looks like you may need to post a little more detail on the forum.

Best regards - Keith

Arse-cover at all costs

Vincent Fleming · ‎02-17-2005

Since your avwait and avque are very high, I would suggest changing your scsi_max_qdepth kernel tunable (either via SAM or kmtune(1M)) to a smaller value - I would suggest 2.

Large values for queue depth cause strange values in sar... your service/wait times will be accurate when you lower this value. This is because the timer is set when the I/O is placed on the queue. Shorter queue, more accurate service times.

It sounds to me like you need to do something about where your data is residing. You don't say, but can you seperate your indices, tablespaces, and redologs in UNIDATA?? I've often seen performance like this when redo logs and tablespaces are on the same physical disks.

If you can, do it. Place them all on different physical disks (or sets of disks, with the metavolumes) - you may be having a head contention issue. (ie: thrashing the heads on the disks) Yes, it's a cached disk array, but you can still have head contention issues.

Good luck,

Vince

No matter where you go, there you are.

Luis Toro · ‎02-18-2005

Thanks for your input. Here's some more info:
PowerPath is controlling the paths to the LUNS, so the IOs should be balanced across 4 fiber HBAs. The busy LUN is one of 5 LUNs in a Volume Group, which contains only one (145gb) filesystem. It is a JFS, with the following options: rw,suid,delaylog,largefiles,datainlog 0 2
My scsi_max_qdepth is 32.
Unfortunately, UNIDATA is not your typical database (with redo logs, indices, etc...); it is a compilation of flat files in on directory. Thats why we're benchmarking on metavolumes, to [hopefully] spread the IOs.
I attached the sar -d output. Note that each LUN shows up 4 times.
I will re-run the test applying your feedback to see if it helps. This process currently runs for 2 days, so I may not post my findings/points until next week.
Thanks again

Vincent Fleming · ‎02-18-2005

Hmm... according to your sar, the devices
c19t3d2
c21t10d2
c23t5d6
c25t12d6

peak at around 1,000 IOPS, aggregate. If this is the same volume, that's not so hot. If that's 2 different LUNs, well... 500 IOPS per LUN stinks.

You should try the LVM striping - by spreading the load over more LUNs (ie: Metas), you could get more throughput, if you don't thrash the heads. The idea is to spread your I/O over as many disks as you can manage. With the EMC, this will also translate into more cache being allocated (they do cache per LUN/Meta).

Use a 1MB or larger stripe size. This seems to keep the LVM overhead to a minimum and allow large volume sizes.

If you have more physical disks in the array that you can throw at it, I would do so.

The general idea is to spread your load out on the back-end of the EMC. Be sure that you stripe over disks/volumes that are on different disk directors (use all you have) to prevent overloading a director and causing a bottleneck there.

RAID-1 is faster than RAID-5 or RAID-S. In fact, don't even THINK about using RAID-S.

Let us know how you make out.

Cheers,

Vince

No matter where you go, there you are.

Luis Toro · ‎03-08-2005

I apologize for the delay, but we've been (and still are) benchmarking various scenarios. Here's an update:
Our Storage person took a look a the performance on the EMC DMX3000, and commented that the I/Os were 60% write and 40% random reads (from the same filesystem). The excessive random reads were bottlenecking the IO pprocess. As a result we went from 4, 5 member Raid-5 metavolumes, to one 20 member Raid-5 meta. That improved performance substantially. The other recommendation was to use mirrored devices instead of Raid5. That cut down the process from 2 days to 1. One thing we also tried was to cut down on the percent buffered cache. It was always set to 30% max, and we were testing various memory configs ( 16/32/64 gb). I cut it down to 2% ( ie., about 1gb max) per Keiths advice, and performance was terrible; after 2 days, the process was only halfway done. So now we're testing various mirrored configs (w/wo striping), and buffered cached.
Thanks

Keith Bryson · ‎03-08-2005

Luis

As you must have buffer cache high for this UNIDATA app, some OS commands WILL run slower (I assume when you shrunk the BC your process was 'waiting cache' for a large percentage of its time). You may also want to check to see if you can have further performance improvements by adjusting the mount options and removing the reliance on BC completely:

convosync=direct
mincache=direct
nodatainlog
delaylog

(see 'man mount_vxfs')

This can have a mostly positive effect with Oracle, I don't know if UNIDATA is comparable. Just an idea if you want to add a few further permutations into your tests.

All the best - Keith

Arse-cover at all costs

Stew McLeod · ‎10-21-2005

I am starting to suspect that when using powerpath, if you want to change the scsi queue depth for devices, you have to change both the O/S value and a value for powerpath

powermt set write_throttle_queue=queue_depth#
[class=symm] [dev=path|device|all]

- 30 - Formatted: October 21, 2005

powermt(1) powermt(1)
April, 2004

Note: This command is not supported on Linux.

powermt set write_throttle_queue sets the write throttling queue
depths for a storage system port connected to a specified device.
The queue depth setting limits the number of writes to all
devices enabled for write throttling which can be outstanding
(from PowerPath's perspective) on the storage system port. The
queues are allocated within PowerPath, one per storage system
port.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

How can I improve my I/O performance ?

How can I improve my I/O performance ?

Re: How can I improve my I/O performance ?

Re: How can I improve my I/O performance ?

Re: How can I improve my I/O performance ?

Re: How can I improve my I/O performance ?

Re: How can I improve my I/O performance ?

Re: How can I improve my I/O performance ?

Re: How can I improve my I/O performance ?

Re: How can I improve my I/O performance ?