StoreVirtual Storage
1748209 Members
2886 Online
108759 Solutions
New Discussion юеВ

Re: Drive Predictive Failure

 
JDohrmann
Occasional Advisor

Drive Predictive Failure

Hello everyone, I have an HP Lefthand node (P4500 G2) that is reporting a predictive failure on a drive that has just been replaced.  The node started reporting the failure so we replaced the drive with a new drive.  The rebuild process took place and everything seemed good until the system started reporting a the predictive failure again on the new drive.  Does anyone have any experience with this situation and if so what is the solution?

9 REPLIES 9
hafeezy2j
Occasional Visitor

Re: Drive Predictive Failure

I'm experiencing the same issue and I have replaced the same drive atleast 2-3 times.  We have recently renewed our HPE support just because of this. I have opened a case with HPE support. Will keep you posted once I get a reply from them.

Stor_Mort
HPE Pro

Re: Drive Predictive Failure

Hi JD, 

This can happen if another disk in the raid set is failing in a way that creates interference on the SAS bus, such as a transaction to disk A that consistently times out and results in a bus reset. If disk B often handles a transaction right after disk A, the disk B transaction will be aborted and increase the error count, even though it was not faulty.

Open an HPE support case and be sure to state that you have replaced the same disk on this node more than once. Ask for log analysis to identify other issues with the disk array.

I am an HPE employee - HPE SImpliVity Support

Accept or Kudo

JDohrmann
Occasional Advisor

Re: Drive Predictive Failure

Thanks for the info Stor_Mort.  Unfortunately we haven't renewed support on this node which is why I am looking for information here.  If there is another disk failing affecting the reported disk why doesn't that disk get reported as failing too?  Where would I look in the logs to isolate the problem?  Is there any solution that doesn't involve engaging support?

Stor_Mort
HPE Pro

Re: Drive Predictive Failure

The algorithm that flags a predictive failure tends to look at media errors. Timeouts and resets are given less weight.

On the system management home page, in the Storage section, you should be able to generate and view an ADU log. Each disk's statistics are listed in the ADU report. Look particularly for 'other time outs' and hardware errors. You can also find the ADU report in a CMC support bundle.

I am an HPE employee - HPE SImpliVity Support

Accept or Kudo

JDohrmann
Occasional Advisor

Re: Drive Predictive Failure

Thank you very much for the information.  I have looked through the hpadu logs and the other disk that is reporting any "hardware errors" is the newly replaced disk.  I didn't find any disks with any time outs.  Am I better off just using spare disks and replacing each disk until it finds the one actually causing the error?

Stor_Mort
HPE Pro

Re: Drive Predictive Failure

There's another log file in the support bundle that is rather inscrutable, but may be helpful. In var/log/slot-logs.tar.gz, pull out the slot.3 file with 7-zip or similar utility. Open the slot.3 file with Wordpad or your favorite text editor. You may see a bunch of lines like this example, or something else.

[11/03 19:20:18]PR 80b3f940h:D017 Op=28 PLErr=02 IopErr=04 S=02 KCQ=3:11:00
[02/08 12:53:57]SC-URE: p_orig=3 p_op=0 l_type=0 D014 block=0x00000000_420D94F8 count=8 bad=0x00000000_420D94F9

[02/17 20:17:33]Logging media error, D009 block=0x00000000_0051F64D info=0x00000044_00000001 count=0x8 flags=0x0

Don't get freaked out if the slot log is full of messages. Disk drives have enormous amounts of redundancy and fault tolerance built in. The above messages are ordinary media errors from which disk drives, in most cases, automatically recover. These messages can sometimes give helpful clues if a drive is behaving badly but not failing.

The tricky part is that the drive is identified by the SCSI ID as D017, D014 or D009. In systems with more than 8 drives like the P4500 G2, you need to subtract 7 to get the bay number, i.e. bay 10, bay 7 and bay 2 in this example. For 8-drive systems, add 1 to the D000 number to get the bay number.

In case anyone doesn't know, bay 2 is in the leftmost column,  middle row. Bay 7 is in the third column from the left, top row. Bay 10 is in the rightmost column, top row.

I am an HPE employee - HPE SImpliVity Support

Accept or Kudo

JDohrmann
Occasional Advisor

Re: Drive Predictive Failure

This is really very good information, thank you so much for taking your time to help. 

I can see in the log where I replaced the drive and it started the rebuild process but almost immediately it started throwing those media errors again.  Like it didn't matter that the disk was new.

Like you said, that log file is full of those messages, these are the most recent:

[03/01 17:13:31]Clear media error D010 block=0x00000000_29E27496 info=0x00000044_00000001 count=0x8 flags=0x1
[03/01 17:13:31]Clear media error D010 block=0x00000000_2D4E0A09 info=0x00000044_00000001 count=0x8 flags=0x1
[03/01 18:13:38]Logging media error, D010 block=0x00000000_29E27496 info=0x00000044_00000001 count=0x8 flags=0x0
[03/01 18:13:38]Logging media error, D010 block=0x00000000_2D4E0A09 info=0x00000044_00000001 count=0x8 flags=0x0
[03/01 18:13:38]Clear media error D010 block=0x00000000_29E27496 info=0x00000044_00000001 count=0x8 flags=0x1
[03/01 18:13:38]Clear media error D010 block=0x00000000_2D4E0A09 info=0x00000044_00000001 count=0x8 flags=0x1
[03/01 19:13:49]Logging media error, D010 block=0x00000000_29E27496 info=0x00000044_00000001 count=0x8 flags=0x0
[03/01 19:13:49]Logging media error, D010 block=0x00000000_2D4E0A09 info=0x00000044_00000001 count=0x8 flags=0x0
[03/01 19:13:49]Clear media error D010 block=0x00000000_29E27496 info=0x00000044_00000001 count=0x8 flags=0x1
[03/01 19:13:49]Clear media error D010 block=0x00000000_2D4E0A09 info=0x00000044_00000001 count=0x8 flags=0x1
[03/01 20:10:31]Logical drive 1 has not completed a Surface Analysis pass in 182 days.
[03/01 20:13:56]Logging media error, D010 block=0x00000000_29E27496 info=0x00000044_00000001 count=0x8 flags=0x0
[03/01 20:13:56]Logging media error, D010 block=0x00000000_2D4E0A09 info=0x00000044_00000001 count=0x8 flags=0x0
[03/01 20:13:56]Clear media error D010 block=0x00000000_29E27496 info=0x00000044_00000001 count=0x8 flags=0x1
[03/01 20:13:56]Clear media error D010 block=0x00000000_2D4E0A09 info=0x00000044_00000001 count=0x8 flags=0x1

They are all referencing Drive 3 and the system seems to clear them as fast as it reports them.  If there is another drive causing problems it doesn't seem to be reporting it.

Stor_Mort
HPE Pro

Re: Drive Predictive Failure

Yeah, this is one of those situations where there is not a clear answer. Is it possible that the drive you replaced is actually faulty? The only way to find out would be to replace it again. But before you do that, pull it out and look at the backplane connector carefully with a flashlight. Any bent pins or suspicious wear?

I am an HPE employee - HPE SImpliVity Support

Accept or Kudo

hafeezy2j
Occasional Visitor

Re: Drive Predictive Failure

We had a similar issue where drive 7 was failing but HPE was also seeing errors on drive 6. We replaced drive 7 first and waited for RAID rebuild process and then drive 6 but as soon RAID rebuild process was completed for drive 6, we got a predictive drive failed for drive2. As per HPE support, they were now seeing errors on drive 2 and drive 11.  Rather simply replacing the drives they did something different this time.

The stopped the manager service and removed the affected node out of management group, replaced drive 2, reconstructed RAID 5 on it and then did the exact same thing for drive 11. once that done they added back the node to the management group and started the manager service. ItтАЩs been TWO days now and volumes are striping from NODE-01 to NODE-02.