Disk Arrays
cancel
Showing results for 
Search instead for 
Did you mean: 

When disk faild in VA7410, I/O throughput was terribly reduce

SOLVED
Go to solution
BG Jeong
Advisor

When disk faild in VA7410, I/O throughput was terribly reduce

we don't have any trouble in normal day.
but
the I/O throughput was terribly reduce.
I attached more technical details.

and
I run sar -d 2 30 in normal day.
In the sar output 2 device file show busy% cose to 100%.
I'd like to know has anyone experienced similar problems before?

thanks in advance.
Tru64 from Korea
14 REPLIES
Kathleen_3
Occasional Visitor

Re: When disk faild in VA7410, I/O throughput was terribly reduce

Hi,

I would like to add my experience with HP VA disk array's when they are rebuilding.

I recentlty had a drive fail and I had to rebuild the array after pulling the drive. The rebuild process took way longer than expected and the performance of the array during the rebuild was so bad that our production oracle database processing and filesystem dumps got messed up.

Our array used to take over 24 hours to rebuild when losing a drive. After talking to HP (and not getting any help from them) I made some changes to my setup. I bought another array and split up our storage into productiona and development. I set up the production array to be in raid 1+0 mode. This uses a huge amount of disk but the docs say it is the best performance. I also checked and tweaked some of the other parameters of the array configuration.

The array rebuild now takes 8 to 12 hours depending on I/O activity and rebuild priority. This is good news but still does not address the issue of PITIFUL performance during the rebuild process. During the rebuild process the array is essentially unusable.

So, one option is to set the array to manual rebuild. This means that I get to choose the time the rebuild runs. This is risky because if another drive fails before I rebuild the array I could end up having a really bad day.

I am now looking for an alternative to HP disk array's. It's unacceptable to have a technology this expensive that has serious problems with performance.

I would appreciate hearing about other folks experience with HP VA disk arrays. Specifically, I would be interested in hearing about solutions to the problem.

Did you find a configuration that works?

Did you replace the HP disk arrays completely, as I am now planning to do?

If you replaced your HP disk arrays, what manufacturer did you use? EMC? NetApp? StorageTek?

Any replies would be appreciated.

thanks,

kev
BG Jeong
Advisor

Re: When disk faild in VA7410, I/O throughput was terribly reduce

Thank for Kathleen.
I use two VA array. - va7110 + va7410.

when they are rebuilding,
VA7110 used to take over 24 hours to rebuild.
VA7410 used to take over 12 ....
and
I can not even connect to server by telnet.

Actually Our VA arrays is managed from HP.
they didn't resolve this pitifull performance during 3 years. ^^;

I am tired of this.
we are now planning to replace by EMC.

I used to many raid system - HSG80, HP EVA5000, Smart array of Proliant server.....
Nothing is wrong except VAs.

I don't really understand, why did HP sell this garbage.

but I don't want to buy New Array..

help after refer to our system config details.

Tru64 from Korea
Sameer_Nirmal
Honored Contributor

Re: When disk faild in VA7410, I/O throughput was terribly reduce

The subject issue which says failure of the drive cause the performance problem. I would blame this to the failed drive. It maybe a case the failed drive jammed the FC loop maybe because some LIP storm or so which would have affected the performance. There are know issues like this and I have gon through them which messed up the FC loop and FC communication and can make the whole storage/SAN un-responsive.

Looking at the configuration details provided, it seems that CommandView SDM ver 1.09.00 is being used. I would suggest to upgrade it to ver. 1.09.02. Starting with the version 1.09.01 and firmware A140, there is a new feature called backend diagnostics and backend performance stats which are quiet useful to know what's going at the backend on the array.


Looking at the logprn result:

FRONTEND_FC_ABTS_EVENT_EH event indicates the host aborting a IO. This maybe VA didn't respond to the IO request within acceptable response time?

FRONTEND_SERVICES_EVENT_EH event, it's extend information needs to be checked. This can be checked using armdiag -W.

So I would suggest to run the diagnostics and get as much details as possible out of the running VA to know what's going on.

The HP virtual array product wasn't successful product in the market. Technically the product looked strongs but did't perform as it should be. I have seen performance problems, Auto RAID issues, many h/w failures ( backend loop issues, FC drive failures , LIP storms etc.) on this product. VA isn't a solid product as it should have been. Maybe that's why HP discontinued this product 2 years back (supporting until 2010 though) and going with MSA/EVA arrays in a big way.

I have worked on HSC,HSG,HSZ ( DEC fan) and EMA which are high performance, reliable and robust products. Nowadays MSA/EVA arrays are technically good and HP has got a good share in the storage market with these products.
Torsten.
Acclaimed Contributor
Solution

Re: When disk faild in VA7410, I/O throughput was terribly reduce

If a VA is set up with a good configuration and running the latest firmware this array is not this bad.

I agree that a failing disk may cause an internal loop to hang.

Apropos configuration.

It looks like your setup is not optimal even with all disks working.

Have a look:

You have LUNs in Redundancy Group 1 and 2, but you access a LUN in RG 1 sometimes via the primary and sometimes via the secondary controller path. Accessing it always via the primary path will increase performance.


Example:

--- Physical volumes ---
PV Name /dev/dsk/c9t0d1
PV Name /dev/dsk/c10t0d1 Alternate Link
PV Status available
...
PV Name /dev/dsk/c10t0d2
PV Name /dev/dsk/c9t0d2 Alternate Link


This is LUN 1 and 2, but

LUN 1:
Redundancy Group:_____________________1

LUN 2:
Redundancy Group:_____________________1


You did not attach an ioscan, but I think you access LUN 2 using the secondary path which will decrease the performance.

Hope this helps!
Regards
Torsten.

__________________________________________________
There are only 10 types of people in the world -
those who understand binary, and those who don't.

__________________________________________________
No support by private messages. Please ask the forum!

If you feel this was helpful please click the KUDOS! thumb below!   
Torsten.
Acclaimed Contributor

Re: When disk faild in VA7410, I/O throughput was terribly reduce

See attachement.

Hope this helps!
Regards
Torsten.

__________________________________________________
There are only 10 types of people in the world -
those who understand binary, and those who don't.

__________________________________________________
No support by private messages. Please ask the forum!

If you feel this was helpful please click the KUDOS! thumb below!   
Torsten.
Acclaimed Contributor

Re: When disk faild in VA7410, I/O throughput was terribly reduce

See also

http://bizsupport.austin.hp.com/bc/docs/support/SupportManual/c00311965/c00311965.pdf

page 46 Product Overview

Hope this helps!
Regards
Torsten.

__________________________________________________
There are only 10 types of people in the world -
those who understand binary, and those who don't.

__________________________________________________
No support by private messages. Please ask the forum!

If you feel this was helpful please click the KUDOS! thumb below!   
Oliver Wriedt
Valued Contributor

Re: When disk faild in VA7410, I/O throughput was terribly reduce

Hi

We solved this by disabling "Auto Rebuild" on all our VAs.
When a disk fails we wait for a period of low activity to replace, which starts balancing process for some hours then.
After that everything is back to normal.

Rgds
oliver
Arend Lensen
Trusted Contributor

Re: When disk faild in VA7410, I/O throughput was terribly reduce

Disabling auto rebuild can be a good idea for an array running in raid-1 modus, for autoraid modus i would not recommend it.
For HP-UX make (very) sure that a recent SCSI IO Cum. patch is installed as in older patches there's a bug that can leave the QD set to 1 for a lun (or several luns). From a host perspective it looks like a really bad performance where in fact the array is almost idle during the rebuild. Please read the text in patch PHKL_29047/29049 and look for: "SR: 8606155022 CR: JAGad24339".
The reason that this issue occurs mostly during a rebuild is because the array will send out "queue full" responses and that will cause the drive to change the queue depth from (the default) 8, to 4,2 and even 1 if they array keeps sending queue fulls. Problem with the buggy driver is that it will not go back to 8 and stay on 1 until you reboot the host. When having 10 hosts its very likely that all will have the same bad performance and hence the finger is pointed to the array.
During normal operation try to find the best setting for queue full treshold (QFT), there are several documents available how to tune it. These are configured for the host ports. Enable Prefetch should be enabled for better read performance.
Try to prevent mixing disks of different size in the same redundancy group as that may severly impact the performance.
For arrays running in autoraid modus, check out the documents regarding "magic capacity" as a wrong configuration will have a bad performance.
If there are any questions please ask, i'm convinced that the array is a good product but tuning is important!. For questions regarding a VA array please supply the output from the armdiag -I -if command.

Best regards,
Arend
Kevin Lister
Frequent Advisor

Re: When disk faild in VA7410, I/O throughput was terribly reduce

Hi Gang,

I am showing up as Kathleen, but my name is Kevin. Looks like there is a problem with the login account.

Anywho, I wanted to reply again after reading some of the other replies.

Specifically, I wanted to assure Arend Lensen that the performance problem with va's during rebuild mode is not a configuration problem and cannot be fixed through tuning the array or anything else for that matter.

I have had HP support take all the information they can from my array, systems, etc, analyze it, and then make suggestions. I implemented the suggestions they made which amounted to buying more VA hardware and placing the arrays in RADI1+0 mode.

I did reduce the amount of time that a rebuild takes. But, the array is still almost completely unusable while the rebuild process is in progress. We had another episode of our production processes getting delayed so badly that there were failures.

The bottom line is that my arrays are well maintained and properly tuned. I even try to make sure that each individual drive I place in the array has the latest stable release of the firmware for that model drive! None of this matters due to the overall design of the array. There is no method of configuration or tuning that is going resolve the ridiculous degradation in performance during the rebuild process.

Here are some possible scenarios I've come up with to get around this problem:

1) I can wait until the weekend to rebuild the array during a time of relative quiescence. What if I have another drive fail during that time?

2) I can mirror everything across multiple arrays and when I need to rebuild one of them I can simply reduce the mirrors and take that array out of the equation.

3) I can scrap my 3 7400's and multiple TB's of disk and buy a better prpduct. This is an expensive solution but it's the one that is most likely to happen at this point.

It's disturbing to think that a major player would sell a product this poor knowing that there are serious problems with the technology.

If any HP support engineers reading this think they have a solution they should contact me immediately. My company is close to ditching the HP storage and the support contracts that go with them.

My email address is klister@ccah-alliance.org. I'll try to fix the account login info to my correct info but if I can't get it fixed please feel free to send me an email directly.

kev
BG Jeong
Advisor

Re: When disk faild in VA7410, I/O throughput was terribly reduce

Thanks Torsten.

I didn't not reconfigure device path yet.
it's too hard to modify path becase DB storage.

I'm frightened that. ^^

Do you teach me step or sample command?

and

if i modify path, how much incease i/o performance? about 10%~20%?

I'm really thank you so much...

ps. attach ioscan log.
Tru64 from Korea
Anthony Cole
Occasional Visitor

Re: When disk faild in VA7410, I/O throughput was terribly reduce

I have recenty started having the same issues with our VA7410. When the array looses a disk the rebuild slows the array to a crawl and the servers start sending out SCSI aborts.

I am working with HP support to solve the issue, but so far we have not come up with any answers. We do have mixed sizes of disks in our redundancy groups. I wonder how big of an impact this is having on the array.

I would consider replacing the smaller disks with the same size. However I don't think that is an option. Each disk replacment will take a rebuild time of 24 hours. With the performance problems we are seeing, it would make the array completely unusable during that time frame and would take several days or weeks.

I am wondering if this could be firmware related. HP upgraded the firmware not too long before we started having issues.

M/C1 and M/C2 are at version 140
M/C1.B1 and M/C2.B1 are at version 5.0
All LCCs are at HP05
All disks have the latest firmware.
Kevin Lister
Frequent Advisor

Re: When disk faild in VA7410, I/O throughput was terribly reduce

Anthony,

I have had this problem on my arrays and there is only one solution: Buy a better product. The HP VA's are mid-range technology at best. They have some neat features and overall I like the arrays but the performance degradation during rebuilds is a major problem.

I haven't been able to resolved the performance problem but I have been able to reduce the rebuild time by doing the following:

1) Keep the number of drives in the array as low as you can.

2) Run the array in RAID1+0 mode only (not Autoraid or RAID5DP).

3) Keep the array in manual rebuild mode and rebuild it at times when I/O is at a minumum.

4) Before you start the rebuild, remove the failed drive and replace it with a good one. Only do this as long as the array is not giving you the "DO NOT ADD OR REMOVE ANY HARDWARE..." message!

5) HP recommends keeping some empty space in the RG's. Not sure that helps.

Although the performance problem is not resolved on my arrays. I have learned to perform the rebuilds on weekends when the I/O activity on my systems is minimal. I've reduced the array rebuild times from over 24 hours to 8 hours or less by doing the stuff listed above.

I also recommend you let HP perform there diagnostic check on the config of your array as they may find some settings that need to be adjusted. They will send you a report on your array detailing any problems they find and recommended changes to correct the problems.

Wish I could be of more help to you!

kev
Anthony Cole
Occasional Visitor

Re: When disk faild in VA7410, I/O throughput was terribly reduce

Thanks Kev,

I talked it over with my team and I think that is exactly what we are going to do. We had plans to replace the SAN in the next 3 or 4 months anyway because of the end of support is approaching. Maybe this will get us by until we can get the replacement.

Anthony
Kevin Lister
Frequent Advisor

Re: When disk faild in VA7410, I/O throughput was terribly reduce

Hi Anthony,

You are welcome!

If you can remember to, it would be cool if you could let us know what hardware you decide to go with.

kev