ProLiant Servers (ML,DL,SL)
cancel
Showing results for 
Search instead for 
Did you mean: 

PL5500, disk failure cause entire array to show failed

Andrew Horne
Frequent Advisor

PL5500, disk failure cause entire array to show failed

Hi,

I have had this problem twice on one server and once on another over the last month.

Two servers both running NW6SP3, one has PSP700 the other is still at PSP630.

A disk in an array fails and shows in the integrated management log as failed but the entire array and any other array on the same bus is also taken off line. The server can be restarted and the array either indicates that the previously failed disk has recovered or has been replaced or it takes a couple of restarts before it thinks the failed disk is ok. What it should show is a failed disk by the indicator light on the front of the disk but doesn't always do so.

The first time this occurred the disk did show as failed by its indicator light but on removal from the server all arrays on the same bus failed.

The second time was on the second machine, the array went offline and the integrated management log showed a particular disk to have failed although no failed indicator light was seen, if the disks was moved to another slot then the disk showed as failed.

This server could be restarted if the disk was in the original slot and would continue functioning but then would fail and go offline after a matter of hours. Replacing the indicated disk seemed to resolve the issue.

The third time was back on the first server to have this error, the arrays on the same bus as the failed disk all went offline, this failed disk was in a different array to the other disk which failed on this machine. This machine also had its array controller replaced (SA3200)after the first failure and also had the backplane and drive cage replaced.

No changes have been made on these machines and they have previously not had this problem if a disk has failed. These are the only two servers of this model that we have and the only ones to have this issue, one is at a remote site and the other is in our main machine room with our other servers.

Has anyone seen issues similar to this? Why are the disks no longer failing properly but becoming intermittent failures and why do all arrays on the same bus dismount in sympathy?

Thanks,

Andy..
3 REPLIES
Jon Ward
Trusted Contributor

Re: PL5500, disk failure cause entire array to show failed

Consider:

- Run the Array Diagnostics Utility from Smartstart, save a report and view the report in Notepad or other text viewer.

- Be sure to use the latest version of CPQARRAY.HAM and CPQSHD.CDM. CPQSHD.CDM should load prior to any HAM driver that may call disk CDMs.

- Check the ongoing status of the arrays using the hp Management Agents for Netware through Insight Manager or http://hostname:2301.
SAKET_5
Honored Contributor

Re: PL5500, disk failure cause entire array to show failed

Hi Andy,

Before troubleshooting this issue further, I would confirm the following:

System BIOS is up-to-date.
Array Controller FW is up-to-date.
FW on the drives are up-to-date.
Proliant Support Pack is up-to-date.

Check after upgrading the proliant support pack which contains the management agents and look specifically for SCSI Read/Write errors on the suspected drives.

What happens when you reseat the affected drives? Do they come back good?

The next thing I would try is to apply M&P (Monitoring and Performance Patch) to the drives.

What do you see when run Proliant Server Diagnostics (INSPECT utility off the Smart Start CD) and/or ADU (Array Diagnostics Utility)?

Let us know the results to the above.

Hope it helps and please don't forget to assign points:)

Regards,


Andrew Horne
Frequent Advisor

Re: PL5500, disk failure cause entire array to show failed

Hi Jon, Hi Saket,

Bios on both machines is up to date.
Array f/w is latest on both machines.
Drive f/w not always up to date.
PSP is latest on one but not on the other, as stated in original message.

Status is always monitored with insight manager, we also monitor mibs so we are alerted as soon as an event happens. These errors give no warning, just complete dismount of all arrays with an indication of the failed disk in the IML.

ADU was run when the first machine had errors but the arrays only went offline when the definitely failed disk was pulled. ADU showed that there was a disk missing from the array. No other errors were seen on the machine from the ADU or other diagnostics from Smart Start.

The last two failures were the ones where the drives would be found again after a restart. Reseating or just leaving them would give a prompt at startup that the original disk was now functioning or had been replaced. On the first of these two I moved two disks, including the supposed failed one, from the lower cage to the upper, the supposed failed one then showed as failed by the indicator on the disk at startup, moving the disks back to the original location then showed no errors and the machine restarted ok.

CPQARRAY and CPQSHD are the latest versions according to the respective PSPs that the servers are running.

These two servers have been running for four years and have behaved as expected during that time, when a disk has failed it has clearly been shown to fail and is indicated fully and have always been hot swapped and the array rebuilt without problem. It is only these last three events where they have stopped behaving correctly.

I find it odd that out of thirty Compaq/HP servers we should suddenly start having very similar failures on two almost identical machines where no recent changes have been made. These are amongst the oldest machines we have that are still in production so the problem should age out eventually.

Thanks for your suggestions, I will troubleshoot further if the problem arises again and keep an eye on the r/w stats in insight manager. Replacing the disk which appears to have failed has removed the problem on all occasions, I am just concerned that a disk failure now breaks the service rather than allows for online replacement and recovery.

Thanks,

Andy..