BladeSystem - General
1754014 Members
3665 Online
108811 Solutions
New Discussion

Replacing a failed drive in a BL460 running ESXi 5.0

 
Nick Salty
Occasional Advisor

Replacing a failed drive in a BL460 running ESXi 5.0

I am running ESXi 5.0 on a BL460 G1 with a RAID 1 array.

Yesterday I saw through the ESXi vcentre client that one of the Disks had failed.

 

I replaced this disk with a spare from a server used for testing. This may have been a mistake.

 

Although the hardware status for the disk in ESXi no longer had an error, the logical disk did, and without the ACU that I am used to, I hoped that the automatic rebuild would occur.

I left it overnight, but nothing appeared to have happened in the morning.

 

I thought that if I could access the ACU that I would need to add the replacement disk into the RAID1 array, and then the rebuild would happen.

So moved the load off this production server, and rebooted with the smart start mounted so that I could access the ACU.

On boot up, POST offered me two selections F1 to use the logical drive and risk data loss, and F2 to disable the logical drive and continue.  It had a 30 second timer, with F2 as the default option.

 

I accepted the default option as that sounded safest...

 

However, the ACU no longer shows my GOOD disk as being part of an array, and shows the replacement disk as part of a failled logical drive!!

 

SO the crux of my issue is:  How can I put my GOOD disk back into an array without losing the data from it?

How do I then add the replacement disk to that array and rebuild?

 

I really really don't want to have to rebuild this server!

 

ANY advice appreciated!

3 REPLIES 3
gregersenj
Honored Contributor

Re: Replacing a failed drive in a BL460 running ESXi 5.0

Question is: How did you do this drive replacement?

 

Normal replacement procdure for Smart Array, with hot swap drives.

Done on-line.

1. Identify failed drive - it will have an amber failure LED lit.

2. Remove failed drive.

3. wait for 30 seconds.

4. insert replacement drive.

Automatic rebuild will begin within a few seconds (In some cases I have seen 3 minutes, when inserting spare drive too soon).

 

Some don't like to do the on-line replacement. So you can do an off-line replacement.

If you choose to shut down and replace the drive with the server powered off - do take care, especially when you re-use drives!

The Smart Array use META DATA, to store the configuration and status. The META DATA is stored on the RIS area of the disks. And there is a copy on all disks! The META DATA also got a time stamp!

 

If you boot a server (ProLiant with Smart Array controller), with 2 disks from 2 different arrays, it will use the disk with the latest time stamp as source drive.

 

If you boot a server with a degraded Array, you will get the F1 / F2 prompt.

F1 = boot in interim  recovy mode (Thats what we want)

F2 = Disable failed (Degraded) logical drives, mostly we don't want that.

F1 is default, except with a few FW versions. If you upgrade you FW it should be corrected, and give you F1 as default.

 

What happens when its disabled?

It disable the logical drive(s), so no changes will happen, no rebuild will start, until you enable the drive.

 

In you case, no bad things has happend yet (Unless you have done otherwise).

 

1. Power off server.

2. Insert the original good drive, in its original posistion.

3. power on server.

4. Hit F1 when prompted.

5. Let the server complete the OS boot.

6. Insert the spare drive.

Unless you have other issues, it will rebuild automatically.

 

I will recommed that you read the Smart Array Technology Brief.

 

BR

/jag

Accept or Kudo

Nick Salty
Occasional Advisor

Re: Replacing a failed drive in a BL460 running ESXi 5.0

Thanks very much for your reply.

Somewhat sadly, I had to take some action before you had done so, but I am probably the wiser for it!

 

The BL460G1 servers only have E200i Raid Controllers, which don't appear to be very Smart, and it appears to me that auto-rebuild does not happen with ESXi installed. 

I would welcome comments on this idea.

 

 

I am comfortable with on-line rebuilds, and using Windows Server 2003/2008 that has always been ok.  Replace the failed drive, wait a moment and the drive lights (and ACU) show that the rebuild is happening.

However, in ESXi 5, I replaced the failed drive and the hardware status showed that the Disk was ok, but the logical drive was not.  There did not appear to be any disk activity on the replacement drive, and after 12 hrs the logical drive was still shown as an error.  The array is only a Raid1 72GB, and only about 20% used, so it shouldn't take too long.

 

So in order to access an ACU, I rebooted the server with a SmartStart CD mounted, and sadly, F2 was the option selected by default.  I probably should have hit Pause/Break to give me time to look up what each option meant, but instead, 30 seconds slipped by and it disabled a logical drive that it had discovered.

 

The ACU did show that A logical drive was disabled, but it was not the logical drive on the Good disk, it was a logical drive it had detected on the Replacement disk.  This should not manage to have a newer timestamp than the Good disk that was never removed from the server, and I know that the replacement disk was blank as I removed all logical drives from it previously.

The Good disk was listed in ACU as Unassigned Disk.

No matter what I did, I was not able to get the ACU to recognise that there should have been a logical drive and data on the Good disk.

After much reading of Smart Array manuals, and some reboots (no OS found), I ended up having to delete the Logical  drive that ACU believed was there, which erased all the data on the Good disk, and start again.

 

Happily, I did have a config backup of ESXi, and the production environment was able to run on 1 server while I rebuilt this server.  It took me about a day all-up by the time I had recovered the config rejoined the cluster and balanced some load.

 

I did lose some iso's and a virtual machine template, but nothing live or very important so I count it as a lesson learned!

 

I think that the next time a disk fails in ESXi (and it will), I will move the load off the server, check if I have anything useful stored on the local disk and reboot , select F1 and launch into the ACU BEFORE replacing the failed drive while online in ACU.

 

Many Thanks for your contribution

gregersenj
Honored Contributor

Re: Replacing a failed drive in a BL460 running ESXi 5.0

Sorry for the late answer.

 

First of all!

The E200 is just as intelligent as all other Smart Array controllers.

The Smart Array controllers do not care, what so ever, about the Operating system.

During rebuild it perform a block to block recontruction, even with an empty Logocal Drive.

 

It seem to me, that the failed drive was running in a RAID 0, with only that drive.

 

BR

/jag

Accept or Kudo