HPE Community read-only access December 15, 2018
This is a maintenance upgrade. You will be able to read articles and posts, but not post or reply.
Hours:
Dec 15, 4:00 am to 10:00 am UTC
Dec 14, 10:00 pm CST to Dec 15, 4:00 am CST
Dec 14, 8:00 pm PST to Dec 15, 2:00 am PST
ProLiant Servers (ML,DL,SL)
cancel
Showing results for 
Search instead for 
Did you mean: 

Excessive Failed Hard Drives

 
SOLVED
Go to solution
Rancher
Honored Contributor

Excessive Failed Hard Drives

We have two ML370 G5 servers, each with 4 MSA 60 storage shelves, running Server 2008, with the HP firmware and software for Insight Manager support. These servers have only been on line since the first of December and we have already had 5 failed hard drives. It is true that we are running a lot of drives, but this seems excessive to me. Any thoughts? Thanks!
35 REPLIES
Michal Kapalka (mikap)
Honored Contributor

Re: Excessive Failed Hard Drives

hi,

i think you have lot of drives, so the number of failed drives are not so big, the question if the failed drives was placed on one shelve, or it was random disk failures.

mikap
Roy Main
Valued Contributor

Re: Excessive Failed Hard Drives

It does sound excessive. You'll want to troubleshoot them a little more the next time you have a failure. Start collecting ADU reports. Run the drive Diagnosis in Insight Diagnostics before you replace the next one and see if it agrees with the failure. Also take a look at the environment, specifically heat and make sure you are operating in the correct temp range.
Rancher
Honored Contributor

Re: Excessive Failed Hard Drives

They are random disk failures. As a matter of fact, I just got an alert this morning that another drive is in the predictive failure mode. I thought maybe it might just be a problem with the HP management agents reporting false failures. This time I ran diagnostics and sure enough, it also showed the drive as failing.
cnb
Honored Contributor

Re: Excessive Failed Hard Drives

Can you run ADU and review or attach the report here?


Rgds,
Rancher
Honored Contributor

Re: Excessive Failed Hard Drives

Last night I updated the controller firmware as cnb advised me to do in a different thread.

The last time I ran ADU and it did fail. I have a screen shot but I don't know how to attach it.
Rancher
Honored Contributor

Re: Excessive Failed Hard Drives

I just had another failed hard drive. Here is what Diagnostics reported:

FailedError: 640001: Controller has reported a SMART error on this drive
Error: 640006: The Read and/or Write HARD error rate is above threshold
wobbe
Respected Contributor

Re: Excessive Failed Hard Drives

I guess most harddisk failures take place dureing the fist few months. So I would expect failure rates to drop.
cnb
Honored Contributor

Re: Excessive Failed Hard Drives

Don't forget to upgrade the disk firmware. Some predictive failure issues were resolved with certain disk drive firmware updates.

Rgds,
cnb
Honored Contributor

Re: Excessive Failed Hard Drives

>The last time I ran ADU and it did fail. I have a screen shot but I don't know how to attach it.

To attach files:
https://forums13.itrc.hp.com/service/forums/helptips.do?#15


Rgds,
cnb
Honored Contributor

Re: Excessive Failed Hard Drives

Click on Save Report, which will save the file in ZIP format. Make sure to rename the file extension to .ZIP or .Zip (not .zip) so that the forum members can open it. Then use the attachment BROWSE button in the reply window to attach the file.

Rgds,
Rancher
Honored Contributor

Re: Excessive Failed Hard Drives

I finally found the hard drive matrix and my firmware is the latest. My last failed hard drive showed the following when I ran diagnostics:

Failed
Error: 640001: Controller has reported a SMART error on this drive
Error: 640006: The Read and/or Write HARD error rate is above threshold
This drive has experienced/recorded error conditions reported by diagnosis and requires replacement
cnb
Honored Contributor

Re: Excessive Failed Hard Drives

Hmmm...

Which Version of Windows 2008 Server are you using exactly?
What Version of Firmware is the MSA60 at?
Is this MSA60 in a dual domain configuration?
What version of Firmware is the P800 controller that you just updated?
What are the drive Model & P/N's?
What version of ACU and ADU are you using (yes it makes a difference ;-))?
IMHO, ADU would be the better application to check and post errors, rather than the diagnostics.

Firmware CD 8.70 Support Guide Matrix:

ftp://ftp.hp.com/pub/c-products/servers/management/smartstart/FWServerSupportGuide8.70.pdf


If you can't post the ADU GUI report, then please try posting the CLI report:

Run the ADU CLI application:
ADU can run from the command-line to create a report text file.
The hpaducli executable is located in the directory where the ADU component was
installed, by default C:\Program Files\Compaq\hpadu\Bin.
"hpaducli -f [filename]" (filename being the file name the adu report
text file will be written to.)
For other command-line options just type:
"hpaducli -h"

Rgds,



Rancher
Honored Contributor

Re: Excessive Failed Hard Drives

Here is some of the information you requested:
Which Version of Windows 2008 Server are you using exactly? Enterprise SP1
What Version of Firmware is the MSA60 at? 2.18
Is this MSA60 in a dual domain configuration? No
What version of Firmware is the P800 controller that you just updated? 7.08

What are the drive Model & P/N's?
146 G SAS, 418367-B21: 450 G SAS, 454232-B21

What version of ACU and ADU are you using (yes it makes a difference ;-))? ACU 8.28
I was under the impression that the ADU is not part of the Online Diagnostic Utility.
I do have a seperate ADU on my 2003 servers, but not the 2008 boxes.
cnb
Honored Contributor

Re: Excessive Failed Hard Drives

the ADU cli report will contain the internal error logs of the drives. Can you post the report?


Rgds,
Rancher
Honored Contributor

Re: Excessive Failed Hard Drives

I do not have this on my server: :\Program Files\Compaq\hpadu\
I do not see the ADU at all, only the ACU. I thought the ADU is not part of the ACU or the new diagnostics.
cnb
Honored Contributor

Re: Excessive Failed Hard Drives

Yep I just saw that.

According to the Release Notes it was integrated with ACU in 8.28:

Changes for ACU 8.28.X.X:

Diagnostics (ADU - Array Diagnostic Utility) is now integrated with ACU (Array Configuration Utility)
GUI interface and icon updates
Tabs control for major task categories...Configuration, Diagnostics, and Wizards
Controller/Device Dropdown control for selecting controllers and devices


Will check out 8.28 and let you know.


Rgds,
cnb
Honored Contributor

Re: Excessive Failed Hard Drives

Maybe you need to update your ACU version?

ACU (8.35) shows there should be three tabs at the top, one is DIAGNOSTICS and there you will have the controllers on the left-side to check which one(s) to check out. Once you've selected the controller(s), you should have two options View and Generate Diagnostic Report. Click on Generate and SAVE the report as a zip file and post it with the ZIP or Zip extension.

Rgds,
Lazarix
Occasional Advisor

Re: Excessive Failed Hard Drives

It is also possible that the backplane may need replacing on the MSA that is failing the hard drives. I have had a MSA-60 that would randomly fail 2x SATA 750GB drives in an array, yet pulling them out and putting them back in would fix it until a few months later when they would 'fail' again.
A call to HP recommended that the backplane needs replacing on the MSA-60
Live by the sword
Rancher
Honored Contributor

Re: Excessive Failed Hard Drives

I am running ACU 8.28 on my servers and do have the diagnose tab. And, it appears that I have another drive failing. I ran Online diagnostics and this was the result:
Physical Hard Drive 61, Serial Number: D2A2P9601GPH0927, Controller Serial Number: PAFGF0N9SXK04A
Failed
Error: 640006: The Read and/or Write HARD error rate is above threshold

However, when you looke at the summary for storage, it shows that everything is fine.

I also ran diagnose thorugh the ACU and have attached the report.
Michal Kapalka (mikap)
Honored Contributor

Re: Excessive Failed Hard Drives

cnb
Honored Contributor

Re: Excessive Failed Hard Drives



You have mixed drive types in a RAID5 volume. Normally not a big issue but you're having problems logged on all drives in this volume, so you'll need to eliminate some possible causes.

One drive needs firmware upgraded from HPD6 to HPDA or replaced with the same drive model number as the rest of the RAID5 volume.

There is an advisory of false errors being reported with Online Diagnositcs whereas ACU/ADU do not report any failures:
http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?objectID=c01697688〈=en&cc=us&taskId=101&prodSeriesId=1121474&prodTypeId=15351

IMHO I would swap out the DG146BAAJB HPD6 with another DG0146BARTP HPD0. Update the Windows Drivers, Online Diagnostics and ACU to the latest versions. Flash your P800 to 7.14. Make sure all drives are at the current firmware versions and monitor.

Windows 2008 Drivers:
http://h20000.www2.hp.com/bizsupport/TechSupport/SoftwareDescription.jsp?lang=en&cc=us&prodTypeId=15351&prodSeriesId=1121474&prodNameId=3279719&swEnvOID=4022&swLang=8&mode=2&taskId=135&swItem=MTX-3336546bdc6842ca816e97d887

Update the Support Pack to version 8.30 and then update ACU & ACUCLI to 8.40. Update the Online Diags to 8.4:
http://h20000.www2.hp.com/bizsupport/TechSupport/SoftwareIndex.jsp?lang=en&cc=us&prodNameId=3279719&prodTypeId=15351&prodSeriesId=1121474&swLang=8&taskId=135&swEnvOID=4022


Rgds,


cnb
Honored Contributor

Re: Excessive Failed Hard Drives

Hi,

One other thought is to swap or fail the HPD6 drive with your hot spare at 4I:16 which is already up to date and the same part/model number as the other RAID5 members:

Rgds,

Rancher
Honored Contributor

Re: Excessive Failed Hard Drives

I will try all your suggestions as soon as our maintenance window comes up. Thank you all for the help. BTW, I had another drive fail this week in one of my Raid 6 arrays.
cnb
Honored Contributor

Re: Excessive Failed Hard Drives

Re: RAID6 failure...

You need someone to take a look at that configuration/diag report? Post it and I'll look at it if you need it.

Best Regards,

cnb