Disk Enclosures
1827208 Members
2497 Online
109716 Solutions
New Discussion

Question about MSA60 - Numerous drives fail as "Hot Removed"

 
StephenWagner7
Frequent Advisor

Question about MSA60 - Numerous drives fail as "Hot Removed"

I was hoping someone can help me out...

 

We have an MSA60 filled with 12 X 500GB HP SATA MDL drives attached to a P800 controller inside of a DL360 G6. We have configured 2 Raid6 volumes, 6 disks each.

 

For the last 8 months or so, it seems as if 1-4 disks fail per month. The general failure reason is "Hot Removed", however no one is removing the disks. I've noticed that these types of failures are most likely to occur within 48 hours after restarting the server and storage unit (yes I'm firmiliar with the proper shut down and startup order).

 

The MSA60 and disks are under warranty. So generally I generate an ADU report, call HP, and initiate a warranty case. After reviewing the ADU reports, they issue replacements, we swap the drive, etc...

 

When I first noticed a trend, I mentioned something to the HP warranty tech on the phone, I was told not to worry and just replace the disk. Since then every 3 or 4 calls, occasionally I ask about how frequent these failures are, and specifically the type of errors we are receiving if it could be something else other than the drives, but they often mention it's probably just the load that we put on the drives. "It depends on the load, how many read/write errors there are, etc..."

 

Our load type is in my opinion light: Users read/write ~100GB a week. Differential backups are done daily (daily reads of 50GB), and a full backup is done weekly (~2TB reads weekly).

 

Should we be going trough drives this much?

 

 

On a side note: 2 months ago, 1 disk failed, when I replaced it, the system didn't recognize the disk and after replacing it with another one the entire array wasn't accessible (event log in windows reporting SCSI errors), I proceeded to shut down the server, then power off the MSA60. After turning it off, and turning it back on, it reported something like 4 disks as failed, 3 in the first Raid 6, and 1 in the second RAID 6. I restarted the unit, same thing, shut it off. Opened it up, removed everything, re-seated every component. After re-seating the I/O board, backplanes, and main board in the MSA60, I finally got it to turn on reporting only 1 drive failure (and the data was good!). Initiated a case, got the drive replaced, and then within 72 hours the other disks that were marked as failed but then cleared, all of a sudden started failing. Thankfully it happened overtime so we could replace them and the rebuild would finish before the next randomly failed.
 I mentioned this to the HP warranty phone tech and he mentioned it could possibly be a backplane or something else, however he said we should just replace the drive and see what happens.

 

Since then, it has mostly behaved until 2 days ago, I had to restart the server and MSA60. 20 hours after restarting both, a drive failed that we replaced 2 months ago.

 

I keep an eye on the read/write error stats on the HP Management page and never see any. For the most part there's no actual logged bad read or writes...

 

I'm scared to shutdown the server and storage units now!

 

PS. Running latest firmware on everything...