HPE StoreVirtual Storage / LeftHand
cancel
Showing results for 
Search instead for 
Did you mean: 

P4500 Hard Drive Failure

UHSMike
Occasional Visitor

P4500 Hard Drive Failure

In the past 2/3 months I've replaced about 14 "bad" hard drives. Out of the 14 only about 2 of them were registered in the CMC as being faulty. The rest were a result of getting the Cluster is "overloaded" /unprotected lun email messages and creating a support ticket with HP. They then look at the logs and notice that drive X is failing (not enough to register in the CMC yet... I guess) then they send us out a replacement. Is anyone else seeing multiple failures like this? Is this normal? Should a drive that isn't bad enough to register in the CMC be bad enough to cause the whole san to be overloaded and bring down a manager?

3 REPLIES
manadrain
Occasional Advisor

Re: P4500 Hard Drive Failure

 

We have a total of 5 P4500 G2 Servers and I have had 5 separate cases opened on an issue what appears to be an issue with Hard Drive Model MB2000FAMYV (I have been receiving model MB2000FBZPN Firmware HPD1). What I have found is that with these units if the systems are rebooted they do not appear to have any issue. However once they are powered off I have had anywhere between 1 - all 12 drives fail in the system. I have had to replace almost 30 drives in the last 2 months. The first case was opened on a different issue with the system board replacement for a bad integrated NIC. HP sent a tech on site to replace the system board and when I powered the system back up I had 6 drives offline. We spent several days on the phone and had a spare raid controller, cable, cache module, battery backup, and backplane sent. However they ended up having to send us 6 replacement drives. All of the drives were running firmware HPD4 and they recommended us update to HPD5. About a month ago I started to do this and I powered off one of the other P4500 G2 Servers and went to the server to put in the HP Firmware Update Disc and when I powered it on I had 7 failed drives (I didn't even get to the point of updating the firmware). In this case I was sent 7 drives and started to question if the systems are powered off are the hard drives failing at a very rapid rate. So to test this on my next system after I repaired the other one I rebooted it instead of powering it off and updated the hard drive firmware to HPD5. The system rebooted without any issue. I then wanted to see what would happen when I power off the system. I powered it off and when I powered it back on I had all 12 drives fail. At this point I believe I was starting to be taken seriously by the techs at HP/Lefthand that a problem exists with these drives. I continued our process and have had to reload 3 of our 5 systems.

 

Last I spoke with an HP/Lefthand tech they stated it may be a BIOS / Hard Drive Firmware issue that has not been fixed. However I have had 3 different sets of drives sent back to their lab for evaluation. I have not got any other information. I have a post similar to this on the same forum. I believe it is titled something similar to yours.

 

I hope the problem gets corrected and if I find out any information I will let you know. However I would be very careful with taking down more than one system at a time because this appears to be a serious problem.

manadrain
Occasional Advisor
Gediminas Vilutis
Frequent Advisor

Re: P4500 Hard Drive Failure

We have 21 P4x00 nodes, ~240 drives (both 15krpm and 7.2krpm) total, in production. Failure rates are much lower, about 1 disk/month, so probably you have bad luck with buggy disk model. From our experience about half of disk failures start with 'cluster overloaded' first symptom, in rest cases SMART first reports disk as 'faulty' or RAID controller removes disk from raid group. 

 

Basically I think problem is that 'near faulty' disk starts to slow down whole RAID controller and storage system IO operations (and, as a consequence, whole storage cluster IO operations), but error rates are not large enough for SMART or CMC to mark it as bad. HP support guys suggested for us to install HP SIM for P4x00 node monitoring (theoretically SIM should monitor disk errors, so theoretically you should notice failing disk earlier). I haven't done that yet. 

 

Gediminas