Re: VA7410 variable performance

Steve Lewis · ‎07-06-2004

We have a bizarre performance problem with our VA7410 attached to a rp8400 through 2x2Gb direct attached switches.

It has 2Gb of cache RAM per controller and 72 disks (30x36,30x72,12x144), configured RAID 0/1.

Periodically it shows bad sar stats through some of the 21 LUNs e.g.

device %busy avque r+w/s blks/s avwait avserv
c8t0d7 99.67 19.94 287 1407 79.21 3.47
c10t1d6 100.00 4.81 254 2215 25.65 3.86

I have verified with HP support that the i/o is going down the primary path to the VA (odd d numbers to c8, evens to c10).
As you can see there is a high queue, low number of ops, low throughput and high wait stats.
The VA queue params are set to 33 and 36, the SCSI queue_depth is still 8.
HP got me to run an armdiag on the array and they say that there is nothing wrong with it.
Sometimes these LUNs give high throughput (>60000 blocks) at low busy and waits. Other times its awful and programs take 10 times as long to run.

1. Any idea why the busy percentage is near 100?
2. Has anyone else seen this sort of performance problem?
3. Does anyone think that setting the SCSI max_queue_depth to something bigger would help, or is it a symptom of problems elsewhere?

Steve Lewis · ‎07-06-2004

Sorry I meant No switches, direct attached fibre from VA array to server.

Steve Lewis · ‎07-06-2004

A bit more backgroup info.

Each RG has about 2Tb of data in it, with 200Gb free, hot spare=largest disk.
No business copy.
The i/o is coming from an informix database through KAIO, tuned up to 3000 max concurrent files / ops using IFMX_HPKAIO_NUM_REQ=3000

The performance problems do happen when a particulary heavy batch process runs in 28 parallel streams, but 2 weeks ago we never had a problem with the same problem and hadn't for the past 3 months.

What has changed last week is that I dropped and loaded a bunch of databases and bound another LUN.

The performance problems can be alleviated by bouncing the VA and server.

The server has 12x875Mhz cpus and 12Gb of RAM.

Ted Buis · ‎07-08-2004

Use can use scsictl to increase the queue depth to your advantage. If you only have the one server for the array, then you can increase the depth without too much worry. The max for HP-UX is 256, but the max for the VA is 240, if I remember correctly. I would consider 32. Have you checked system logs for I/O errors? Also, as a file system gets full on any storage system is slows down. Do a bdf and see how full your file systems are. Lastly, what is your maximum dynamic buffer cache setting in the kernel. If you have large RAM and it is set at the default 50%, then you likely need to reduce it so that it is taking less than 1 GB. Also, check to make sure you have sufficient RAM and that you aren't doing page-outs.

Mom 6

Steve Lewis · ‎07-08-2004

Thanks for replying Ted. I have tried increasing the scsi queue depth to 28 on each LUN. My dodgy math for this was to multiply the physical disks by the default depth of 8, then divide by the number of LUNs. It made no difference to the performance, which indicates to me that the queue originates from the storage or the HP-UX KAIO, not the array cache / controllers.
No i/o errors reported in the syslog.
The filesystems have up to 100Gb free, although it can vary by 60Gb per day, because of all the data loading/unloading that goes on.
dbc_max_pct/min_pct is 10/5 on 12Gb of RAM. 5/5 would be better, but I think its pretty marginal as it is, since nearly all the i/o is raw from database to logical volume, bypassing filesystems and buffers.
We have 600Mb of RAM free, as indicated by vmstat showing 150000 free pages, pageouts never happen.

The other thing that happened is we added an extra tray of disks to the array.

Does anyone know how the VA balances existing data across a new tray of disks, when it gets added?

Bernhard Mueller · ‎07-08-2004

Steve,

I found the following quite interesting:

http://search.hp.com/redirect.html?url=http%3A//forums1.itrc.hp.com/service/forums/questionanswer.do%3FthreadId%3D218262&qt=VA7410&hit=4

Regards,
Bernhard

Steve Lewis · ‎07-08-2004

Thanks Bernhard, that was very interesting.
Yes my boss bought 12x146s because it gave the most storage for the money, which was so tight he couldn't afford the other 3 disks to fill up the DS2405.
So we have the 146Gb disks with 4 times the i/o of the 36Gb disks. Surely other people must be adding more trays to their arrays as well, with larger disks.

This alone does not explain the variation in performance, because sometimes it is fine and reboots make it go quicker for a while.

Ted Buis · ‎07-09-2004

The link is interesting, but it doesn't make much sense to me. How is is RAID5DP going to be faster than RAID 1/0? The answer may be that AutoRAID tries to keep most frequently accessed data in RAID 1/0, and if there is no locality of reference, this is a wasted effort. Also, once the array passes a point of being too busy, it suspends these optimization efforts. Why would your performance vary so greatly? I think the key is more likely in different usage. You say that it can vary by 60GB per day, which means that you are writing up to 60GB per day, or quite possibly much more. If the VA buffer cache fills, you will get a big slow down in performance. Normally, if the cache isn't full, the VA will accept the data for a write into the cache and immediately report back to the host that the I/O is complete which is much faster than backend peformance of actually doing the write. Once the cache is full, performance is going to be more like the back end rate, with latency looking like there is no cache at all since I/O will have to complete on the back-end before there is more room in the cache. That would be much much slower. How many LUNs do you have and of what size? Many small LUNs can "solve" the SCSI queue depth issue on the host, but not help the virtual array to optimize the array resources. Do you have Glance or Measureware so you can see your I/Os per second? Are most of your accesses large sequential transfers or small random block typical of OLTP?

Mom 6

Steve Lewis · ‎07-09-2004

I have just been told that the poor performance may have been due to us copying data from existing LUNs into the newly-created ones.
We had to free up space in one database instance to make room for a big process over the weekend.
Deleting databases is not the same as deleting a LUN, as it is just a logical delete and the VA thinks the space is still allocated.
We *may* have had a consequential issue with read-modify-writes at this time. They cause 6 times the writes to disks, than a fresh write into newly allocated space.

Well thats the current theory anyway.

We have 21 LUNs in 2Tb of space, approx 50/50 in RG1 and RG2.

I understand about the cache/disk writes. but we have run the same process several times before with no performance degredation so that isnt the reason.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: VA7410 variable performance

VA7410 variable performance