Disk I/O Performance Problem

Christian Briggs · ‎05-30-2006

Hello ITRC’ers:
I have an interesting problem today (well, I’ve probably had the problem for a while, but noticed it recently) – by the way, sorry this is so long, this is complicated to explain. We have a file that is about 6.5GB in size and a number of servers attached to an EMC DMX 1000 array with Brocade switches. When transferring the file from one server to another it takes approximately 20 minutes. We ran this test several times on several different servers (Itanium and PA-RISC, HP-UX 11.11 and 11.23). We have already verified the network settings and everything is set to and running at 1Gb.

Here is where things start to get interesting. In all of our original tests the source server hosted the file on the SAN and the destination was another host with the target directory on the SAN. Someone in our group suggested using the boot disk (bad practice, but these are all just tests). When we went from source filesystem on the SAN to destination filesystem on the local boot disk of the target server our time was cut in half. When we went to a configuration where we ftp’d from a server with the source file on a file system on the local boot disk to another server with the destination filesystem on the local boot disk the time went down to 3 minutes to complete the transfer. So, we are ruling out network as the source of the bottleneck and investigating this as a disk I/O bottleneck.

At various points in this process we did fire up good ‘ol Glance and found that the disks that we were writing to (we never looked at the disks that we were reading from) were 98% of the time with 9 or more writes queued and we would see high water marks of over 80 writes queued (by visual observation over the period I would estimate that we averaged about 40 writes queued). This was with largefiles and all other options being default for the filesystem mount. We did switch to largefiles, nodatainlog, mincache=direct, convosync=direct and that did stop us from queuing up writes, but didn’t improve performance.

One last thing, even copying the file from one directory on the SAN to another directory (on the same host) that is also on the SAN takes almost 8 minutes. From all other stand points our performance is acceptable, but something seems to be really wrong in our I/O subsystem and we’d like to get that fixed before our work load changes and this kind of performance is no longer acceptable – thoughts for points?

A. Clay Stephenson · ‎05-30-2006

The first thing that I would do is undo your mincache=direct,convsync=direct mount options because you are completely bypassing the buffer cache and thus causing the box to wait excessively for i/o's.

You should note that the minimum possible transfer time for a 6.5GByte file over a 1GBit/sec pipe is ~ 65 seconds; in the real world you would get nowhere near that. If both the reads and writes are going over the same i/o channel then the minimum time (at least) doubles. At first blush (because of your local disk data), it appears that you don't have enough i/o paths to your disk array.

I should also add that if you happen to be running small UNIX buffer caches (even after reverting to the conventional mount options) then that would have almost the effect as bypassing the buffer cache -- so check you buffer cache sizes. I find that on 11.11 & up systems buffer caches in the 800-1600MB range do nicely.

If it ain't broke, I can fix that.

A. Clay Stephenson · ‎05-30-2006

A rather useful metric would be to determine how fast you are able to read from one of your array LUN's.

e.g. timex dd if=/dev/rdsk/c5t3d12 of=/dev/null bs=1024k count=1000

In the above case, you can even use a disk that is part of a VG with a mounted filesystem since this is a read test only.

If you have an unused LUN, you could also repeat the dd test but this time using /dev/zero as the input file and your unused LUN as the output file.

I would also repeat the disk read test but this time using a 1GB (or larger) fully cooked file for input. This will give you some indication of the filesystem overhead.

The other thing to look at is the RAID level you are running; RAID5 typically takes a 3x-7x performance hit as compared to RAID 1/0 for writes; this in itself would easily explain why local disks run faster.

If it ain't broke, I can fix that.

Christian Briggs · ‎05-30-2006

We have some stripped filesystems that we use the mincache-direct, convsync-direct mount options on - otherwise we don't use it. We did it for this test to see what affect (if any) it would have.

Our I/O pipes to the array are 2 Gbit/sec so it doens't seem like an ftp over a 1Gbit/sec network pipe would saturate that. We aren't running any load balancing software (PowerPath, SecurePath, etc.) so more i/o paths probably won't help. One of the servers we tested on isn't being used at all so the only i/o to the SAN disks was from our ftps and cps.

We keep our buffer cache size between 700 and 900MB. Also we monitored that and didn't see it go over 66% utilization.

Any more thoughts?

Geoff Wild · ‎05-30-2006

Okay - how is the frame configured?
Raid5? or worse EMC's Parity Raid?

We do RAID 10 striped across all the disks in the frame.

The reason it is better to local disks is because you are probably mirrored.

Also, how much cache is in your frame?

Rgds...Geoff

Proverbs 3:5,6 Trust in the Lord with all your heart and lean not on your own understanding; in all your ways acknowledge him, and he will make all your paths straight.

Christian Briggs · ‎05-30-2006

EMC DMX 1000 24GB Cache
RAID S 3+1

STK D280 - not sure, we believe this is just a fixed amount for this type of frame and not upgradable - 1GB per controller?
RAID 5

Yes, the local disks are mirrored - but really, could that account for a 6x difference in time? Maybe we should switch back to all local disk if we are taking that much of a performance hit . I'm not sure that I really understand the difference between RAID 5 and RAID S - Geoff, why do you say "or worse"?

One of the systems that we were testing on has a connection to both the EMC array and an STK D280 array. I figured that since I was going to do some timing tests as suggested by Clay that I would test the EMC, the STK and my local disks - here are the read / write results (note, I didn't have any open LUNs, so I just did write tests to a file):

du -k big_dump.dmp
6691362 big_dump.dmp

Raw EMC Read
timex dd if=/dev/rdsk/c4t7d6 of=/dev/null bs=1024k count=1000
1000+0 records in
1000+0 records out

real 35.40
user 0.00
sys 0.03

Raw STK Read
timex dd if=/dev/rdsk/c22t2d0 of=/dev/null bs=1024k count=1000
1000+0 records in
1000+0 records out

real 6.40
user 0.00
sys 0.03

LVM EMC Read
timex dd if=/junk/big_dump.dmp of=/dev/null bs=1024k count=1000
1000+0 records in
1000+0 records out

real 46.09
user 0.00
sys 0.21

LVM STK Read
timex dd if=/u089/big_dump.dmp of=/dev/null bs=1024k count=1000
1000+0 records in
1000+0 records out

real 6.46
user 0.00
sys 0.27

LVM Local SCSI Read
timex dd if=/test/big_dump.dmp of=/dev/null bs=1024k count=1000
1000+0 records in
1000+0 records out

real 30.29
user 0.00
sys 0.22

LVM EMC Write
timex dd if=/dev/zero of=/junk/tstfile.dd bs=1024k count=1000
1000+0 records in
1000+0 records out

real 2:33.67
user 0.00
sys 0.68

LVM STK Write
timex dd if=/dev/zero of=/u089/tstfile.dd bs=1024k count=1000
1000+0 records in
1000+0 records out

real 11.71
user 0.00
sys 0.69

LVM SCSI Write
timex dd if=/dev/zero of=/usr/tstfile.dd bs=1024k count=1000
1000+0 records in
1000+0 records out

real 36.65
user 0.00
sys 0.66

Thanks for the help!

Zinky · ‎05-30-2006

Theories:

1. Your DMX is saturated everytime you do your tests. Your cache on the DMX is maxed out.

2. Your kernel parameters for SCSI Max Queue Depth is not set per suggestion:

Find out via: "kmtune|grep scsi_max_qdepth"

3. Check your SAN health. Check your Switches and HBAs.

This possibly explains the queueing you are having. Try increasing it to 32 or even 64.

Can you post "sar -d 5 10" output whilst you are doing your tests or when I/O is intense?

Hakuna Matata

Favourite Toy:
AMD Athlon II X6 1090T 6-core, 16GB RAM, 12TB ZFS RAIDZ-2 Storage. Linux Centos 5.6 running KVM Hypervisor. Virtual Machines: Ubuntu, Mint, Solaris 10, Windows 7 Professional, Windows XP Pro, Windows Server 2008R2, DOS 6.22, OpenFiler

Geoff Wild · ‎05-30-2006

I say worse becuse EMC tried to sell us on that and when we benchmarked it - it was terrible - the old EMC frames were twice as fast. The problem with their parity raid is - it spends way too much time calculating. Sure, you get more capacity - but a big performance hit.

Another thing you can do - as yo don't have powerpath, is manually load balance your servers.

Example, say you have 2 HBA's(2 paths):

vgdisplay -v /dev/vg20 |grep "PV Name"|grep -v Alternate |awk '{print $3}' >/tmp/vg20.primary

vi /tmp/vg20.primary and remove say all the even lines (dd line 2, 4, 6, etc)

for i in `cat /tmp/vg20.primary `
do
vgreduce /tmp/vg20 $i
done
for i in `cat /tmp/vg20.primary`
do
vgextend /tmp/vg20 $i
done

Rgds...Geoff

Proverbs 3:5,6 Trust in the Lord with all your heart and lean not on your own understanding; in all your ways acknowledge him, and he will make all your paths straight.

Christian Briggs · ‎05-30-2006

We are kind of leaning toward the maxed out cache. I called both EMC and HP before I ever started this thread. I'm not sure how to monitor the cache on the EMC to prove / disprove this theory.

scsi_max_qdepth = 8. This has never been a problem before, most of our disk traffic is small i/o. What are others using - Nelson, do you have a suggestion?

I don't think it is hardware as we experience the same thing on several systems, SAN ports, and switches.

Output of the requested sar -d 5 10 (ran twice because things slow down more as time goes by) is attached.

Geoff, thanks for the suggestion about manually load balancing. We do that already, but I suspect that we don't get any gain from that in this case because each LUN is 14GB and my file is 6.5GB, therefore I would only be writing on a single path during the whole process.

A. Clay Stephenson · ‎05-30-2006

I am making the assumption that your StorageTek and EMC arrays are using equivalent i/o paths (ie I'll assume that the STK doesn't have six Fibre connections to the same host while the EMC has but one) and I also assume that the test is more or less fair in that the EMC is not unduly busy servicing requests from other hosts while the STK is all but idle. If so, you have found your culprit. The EMC array does not appear to be configured very well. I would setup a RAID 1/0 LUN on the EMC and see how well it performs. The host settings
(e.g. maxscsiqueuedepth) are the same and yet the performance differences are very large so it seems you need to find a way to make the STK much slower or the EMC much faster. Obviously one of these approaches is preferred over the other.

It would also be interesting to see the response from EMC support when you present them with these (presumably fair) tests.

If it ain't broke, I can fix that.

Christian Briggs · ‎06-01-2006

Yep, I tried to make my test as fair as I could. Anyway, I am in the middle of working EMC support right now. Their performance engineer is saying that he has seen this kind of thing before and that going to RAID 5 on the array fixes it - although that means a complete unload of the data, redo of the bin file, and reload of the data - not sure I want to go that route.

I'd be interested in hearing some real world opinions on other arrays (maybe like the EVAs) - maybe I'll start another discussion thread...

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Disk I/O Performance Problem

Disk I/O Performance Problem

Re: Disk I/O Performance Problem

Re: Disk I/O Performance Problem

Re: Disk I/O Performance Problem

Re: Disk I/O Performance Problem

Re: Disk I/O Performance Problem

Re: Disk I/O Performance Problem

Re: Disk I/O Performance Problem

Re: Disk I/O Performance Problem

Re: Disk I/O Performance Problem

Re: Disk I/O Performance Problem