ProLiant Servers (ML,DL,SL)
cancel
Showing results for 
Search instead for 
Did you mean: 

Predictive Failure - Actions to take?

SOLVED
Go to solution
Miriam Haber
Occasional Advisor

Predictive Failure - Actions to take?

We have a database serving our website that has a RAID 5 array. One of the drives is showing "Predictive Failure" status in Compaq Insight Manager.

Since we cannot take this server down, it seems the only thing to do is wait for the drive to fail and then hot-swap it.

Or is there a way to remove that drive from the RAID array before it fails? Thanks!
10 REPLIES
Mark Cloutier
Respected Contributor
Solution

Re: Predictive Failure - Actions to take?

Dear Miriam,
You are right. Even though the drive has been marked as Pre-Failure, you still have to shut down the server to replace it. There is no way around this until the drive has failed.
To protect your data you might want to look at having an on-line spare for the RAID.

Mark
We are here for a good time, not a long time!
Jim Colley_1
Occasional Contributor

Re: Predictive Failure - Actions to take?

If you have a warranty on the system, you should call for a new drive now. The predictive failure message is all you need to order a new drive. Then the drive is in your hand when the old one fails.
Doug de Werd
Honored Contributor

Re: Predictive Failure - Actions to take?

If the drive is connected to a Smart Array Controller, you should not have to shut down the system to make the change (such are the virtues of hot-plug drives!). Once you have the spare drive, simply pull the other drive out (which in effect will "fail" it) and plug in the replacement. The RAID set will automatically rebuild in the background.

However, a couple of things to remember - first, ALWAYS have a backup of your data before you do this, and second, remember that during the rebuild, you do not have RAID protection until the rebuild is complete, For this reason, you may want to schedule the rebuild during off hours (also, the speed of the rebuild is dependent on how much regular disk I/O is occurring, so if there is little or no disk I/O, then the rebuild will complete faster).

This applies to Smart Array Controllers - if you have another type of controller, then it might not work the same way.

Thanks,
Doug
Expert in ProLiant Clusters

Re: Predictive Failure - Actions to take?

Could Mark describe the circumstances that make his answer correct? Like Doug mentions, I have changed drives on Smart Array Controllers using RAID 5 that were still working. I have not had to shut down the system, wait for the drive to fail, or even reboot the server when swapping the drive. But if there are times I need to shut it down or wait for the failure I want to know when.
Miriam Haber
Occasional Advisor

Re: Predictive Failure - Actions to take?

Thank you for the feedback. Since this is a production server for our company's website, I have taken the cautious route and let it continue running with the "failing" drive in place. Oddly, though the drive is in "predictive failure" mode, it has not yet failed.

I called Compaq tech support and they said I may be getting a false alert. Uusually, the drive would fail within 24 hours of the "Predictive Failure" status message. There is a patch I need to add to prevent false alarms. However, since the patch also requires a reboot, it will have to wait until the next scheduled maintenance.
Mark Cloutier
Respected Contributor

Re: Predictive Failure - Actions to take?

For clarification:
If Insight Manager shows a drive to be in Pre-failure, the server has to be shut down and the drive can be removed/replaced. The Array Controller still sees the drive as a good working drive and will continue to access the drive. If you remove the drive when the Array controller is striping data to it then you might encounter corrupt data. Therefore if the light has NOT changed on the drive to show it as bad then shut down the Server to remove it.
Many many people do just remove it when Insight Manager repors it as Pre-failure. This could corrupt date or even the RAID set.
Does this explaination help?

Mark
We are here for a good time, not a long time!

Re: Predictive Failure - Actions to take?

Marks clarification helps. Thanks. If the RAID is setup with an on-line spare, will the spare become active when one drive shows "Predictive Failure" or will it wait until it fails?
Mark Cloutier
Respected Contributor

Re: Predictive Failure - Actions to take?

Unfortunately no.
The on-line spare will wait for the drive to fail before it becomes active.

Mark
We are here for a good time, not a long time!

Re: Predictive Failure - Actions to take?

No reason why you shouldn't be able to unplug the pre-failure drive and replace it.

As a matter of fact, we've done this on quite a few servers to replace 18GB7.2K drives with 36GB10K drives. After all drives had been replaced we wound up with lots of extra space on the array. We then either created extra partitions, spanned partions or resized partitions with Partition Magic (all require reboots, unfortunately)
John Marinov
Frequent Advisor

Re: Predictive Failure - Actions to take?

Um. With hot swap drives? And array controller?

Provided you have hot swap drives and an array controller, you can most definitely remove the drive which is showing "Predictive Failure" while the server is running. That is what you paid extra for.

Removing the drive will fail the array (shows up as yellow), and putting a new drive in will trigger a rebuild event. You should see the raid controller start to build the raid-set.

If you get the wrong drive, you're still OK because predictive failure is exactly that: Insight thinks the drive will fail because of some measurements. But for now it is still working.

If you pull a drive and then put it back (in the same slot), hot, the array will still be OK. There will be no net effect.

It is possible that the predictive failure is caused by old drive bios level. There is a customer advisory out about this.

Things to watch out for:
Make sure the drive was previously erased. Not too much of a problem when doing hot swaps, but likely to cause much angst if you shut down first.

Make sure that the drive BIOS revisions are up to date.

If the server shuts down when the array is failed, be careful when starting the system. The two prompts will be "Fail the array" and "Fail the drive and continue with interim recovery". Be sure to choose "Continue with interim recovery".

How do I know? Experience.