ProLiant Servers (ML,DL,SL)
cancel
Showing results for 
Search instead for 
Did you mean: 

DL380 G6 with P410i controller and a drive in predictive failure status

DL380 G6 with P410i controller and a drive in predictive failure status

...and I was going to yank and replace this drive once I verified that all of the other drives were good but another support person I work with claims that, since this drive is still actively being written to (unlike a failed drive which would  force the array to utilize the parity info), hot swapping the drive may cause data loss. I was further urged to boot into the array boot utility after cold-swapping the drive  in order to rebuild the array. I don't mind doing this but I reeaaaaallly want to reduce down-time. A quick search online yeilds mixed opinions on what to do -- which brings me here. I'd love to hear your opinion. Thanks!

9 REPLIES
waaronb
Respected Contributor

Re: DL380 G6 with P410i controller and a drive in predictive failure status

Nah... it's actually better (if you ask me) to replace the drive when it's running, rather than in a power-down situation.

In my experience, powering down and then replacing the drive can *sometime* confuse the system, especially if the new drive maybe came from some other system and there's still some array config info on it.

I've seen a few recent reports of people pulling a drive to replace it and having the whole array go offline, but I'm not convinced their arrays were healthy to start with.

I've never had an array conk out when pulling a drive for replacement, unless it was a stupid mistake I made like maybe I replaced one drive for some reason and then didn't wait until the drive rebuilt before pulling out another one. In a situation like that, yeah, until it's rebuilt, you do NOT have redundancy.

So just make sure that an array with a drive in predictive failure is still fully redundant, no other drives are out. And whether it's predicted to fail or already failed, make sure grab the right drive. If you're not sure, start up the array config util (ACU or SSA) and turn on the failing drive's light. Make it blink. Then you'll know you're getting the right one.

After all, if you can't pull a drive to replace it, in a nice, controlled way, then how can you trust it to protect you if that same drive failed on it's own in the middle of the night?
RbBsmn
Regular Advisor

Re: DL380 G6 with P410i controller and a drive in predictive failure status

HP advices to replace predictive failures when the server is down. There is an advisory about that too, will try to dig it up for you when I'm on my pc.
9 Out of 10 times it works hot plug too, but HP has to advise customers of the safetst way of course.
Btw if you have more than 1 predictive failure in the same array, please do not shut down the server but replace one of the drives hot plug and only when it comes to the last predictive failure you should turn down the server and replace the last drive.

Had a client with 5 out of 5 hard drives on predictive failure a few weeks ago and we replaced all of them without any problems with the logical drive. :)
Did my post help? Thank me with kudo's! :)

Re: DL380 G6 with P410i controller and a drive in predictive failure status

Thanks for the reply.

Just to make sure I understand:

With only one drive in predictive failure, we should power down the server and replace the drive. Once we reboot, it should automatically rebuild even if the system is in use, right?  Or should we boot into some RAID utility to perform the rebuild? If its the latter, since the server will be down, how long, typically, should this take?

 

If there are multiple drives in predictive failure, hot swap any failed drives, one by one,  first and let them rebuild. Then do the same forall but the last drive in predictive failure status. Power down the server and replace the last drive in predictive failure status as above.

 

Sound, right?

RbBsmn
Regular Advisor

Re: DL380 G6 with P410i controller and a drive in predictive failure status

Yes that is indeed correct. The drive would start rebuilding by itself too, however you might get a message during POST that there is data found in the cache module.
Did my post help? Thank me with kudo's! :)
RbBsmn
Regular Advisor

Re: DL380 G6 with P410i controller and a drive in predictive failure status

The before mentioned advisory:
Title: HP ProLiant Servers/c-Class BladeSystems - Predictive Drive Failure Replacement

 mmr_kc-0117321

 Support Information

 KCS - ProLiant Servers

 Public

 final

Environment

FACT:HP ProLiant ServersFACT:C-class blade systems

Questions/Symptoms

SYMPTOM:Drive shows a flashing amber ledSYMPTOM:ADU shows : Message: Physical Drive State: Predictive failure. This physical drive is predicted to fail soon.SYMPTOM:Predictive Drive Failure Replacement

Cause

CAUSE:The hard drive is in predictive failure state.

Answer/Solution

FIX: 1. Gracefully shutdown the server. 2. Remove the failed disk. 3. Insert a new disk in the same slot. 4. Boot the server. NOTE: Pulling an online hard disk drive to replace it while the server is powered on there is a chance to lose data. Online meaning that the HDD has not fully failed. A failed HDD is indicated by the Amber/Red Solid FAULT LED being on and ACU showing that the physical drive has failed.

© Copyright 2014 Hewlett-Packard Development Company, L.P.

Did my post help? Thank me with kudo's! :)
lightman
Advisor

Re: DL380 G6 with P410i controller and a drive in predictive failure status

RbBsmn thank you for this info.
I was reading the thread, didn't knew about the difference between pulling an online (predictive failure) disk from an array vs a failed one, very important information.
thank you

Re: DL380 G6 with P410i controller and a drive in predictive failure status

Thanks for the replies!

 

I have one more question which is actually implied in the responses but I want to drive it home:

 When I reboot the server after replacing the drive, I need not do anything else, right? It will boot (partially??) and then rebuild offline without any input by me, correct?  I am actually sending a drive to the server site and plan on walking someone through the process and I'm just trying to cover my bases. Thanks!

PZel
Valued Contributor

Re: DL380 G6 with P410i controller and a drive in predictive failure status

The way i handle a predictive failure:

1) First check if it is part of RAID1/RAID5/6. Incase its RAID0 then backup as much data as possible.

2) When its redundant (RAID1/5/6) then locate the drive (via Start in the SystemsManagementsHomePage)

3) Power Down the server

4) Physically remove the faulty drive

5) Power up the server: It will come with a POST error, either to Disable the Array (F1) OR to fail the drive and work in the Interrim Recovery Mode (F2). Choose <F2>, because <F1> will disable the array, then you cannot start the O/S.

Because a redundant array will start up without one disk, the O/S will starting up.

6) Wait a moment (untill its not so busy on the SCSI bus: approx. 2 minutes)

7)Physically insert the new disk on the exact same location where you pulled out the faulty drive.

8) Sometimes the new drive is initialized (=red for a short time), and then wait until the 'middle' light is flashing. Then its rebuilding with the good disk(s) in the array.

PZ
waaronb
Respected Contributor

Re: DL380 G6 with P410i controller and a drive in predictive failure status

Hmm... in all my years of replacing predictive (or actually) failed drives, I've always done a hot swap with a new drive while the system was running. Never had a problem.

I guess the HP advice to shut down probably is just to make extra sure, but like I said, if it can't handle a drive being pulled out to exchange with a new one while the system is running, why would I trust it to protect me if one of the drives just plain died while the system was running?

It wouldn't make me feel good about the reliability of the controller if HP is saying they won't guarantee it'll work if you hot swap a drive, and fortunately I don't think HP really thinks that. :)

The whole selling point of hot-swappable drives is the reduction of down-time to replace failed parts. It'd be like HP saying you couldn't hot-plug redundant power supplies either and that you should shut the system down before doing it. Why make it hot pluggable in the first place?