ProLiant Servers (ML,DL,SL)
1827293 Members
1877 Online
109717 Solutions
New Discussion

Re: Single drive failure causes logical RAID5 failure

 
Michael Oberholtzer
Occasional Advisor

Single drive failure causes logical RAID5 failure

Unfortunately server is not under service, so I thought I would post to see if there are any thoughts or suggestions.

The question in short:
Is it possible to convince/force an SA6400 to attempt to boot using RAID set members that it thinks are bad?

Server: DL380 G3
Cntrl: SA6400
Phys. Drives: 6x146GB
Logical Drvs: 5-drive RAID5 (ID 0-4),
1 hot spare (ID 5)

It appears that drive ID 3 failed and caused some error on the SCSI bus that confused the controller or corrupted info on some of the other drives.

Server was running fine until reboot attempt following qtly MS patch installation. No system errors ever observed on drive array. The server never came back up after Windows restart. I was not physically watching the box during the reboot, so it's unknown what, if any, errors displayed during shutdown & initial reboot attempt.

After reboot, the lone logical drive status was failed. In fact, with all drives connected, the SA6400 showed 0 logical drives available. With the apparently failed drive ID 3 removed, the SA6400 showed the other drives as present -- ID 2 was labeled OK, but 0,1,& 4 requiring replacement.

We've already restored the data to another box but thought this might be a valuable learning experience.

Regards
2 REPLIES 2
Mark Matthews
Respected Contributor

Re: Single drive failure causes logical RAID5 failure

Hi Michael,

I wonder why the online spare didnt kick in?

Unless it happened like this...

ID3 failed so ID5 kicked in
Another ID failed while ID5 was rebuilding or perhaps when you rebooted...
RAID5 so you cant lose 2 drives
When its rebooted (and you weren't watching) the array controller has flagged up that horrible F1 or F2 prompt (i've killed someones server by accident not fully understanding this!)

And it DEFAULTS to F2 within 30 secs, which is "Fail logical drives" or something similar. Which I wish they would change!
(F1 is "Continue with logical drives disabled")

Not much help Im afraid, but it could be an explanation

Thanks
Mark...


---------------------------------------------------------------------------------
Please click the white Kudos star to the left if this post is helpful :)
gregersenj
Honored Contributor

Re: Single drive failure causes logical RAID5 failure

The old SCSI bus is a parallel bus, and if very unlucky a drive could caurse disaster.

You could have a bad backplane, SCSI cable or array controller.

If you get a multible drive failure, the Smart Array controller (SA) will stop!
You can power down the box, and reseat the drives. If that brings them back, the SA will prompt you, saying that drives that previous marked bad, appears to be back on-line, and ask if you want to re-enable the LUN(s).
If it succed, then you can replace drives one by one, as quickly as possible.

If you got a bad backplane, cable or SA, then those can be replaced.

If you really got multible disk failure, then the is only your back.

When you have got hard drives running for many years, a power cycle may caurse drives and power supplies to fai.

>Mark I think you may have misread the prompt.
If you reboot or repower a SA with a failed or missing drive:
It will prompt you to press F1 or F2, and F2 is default within 30 sec.

F1 = disable affected LUN's: It will only be disbled untill the next boot, and you will be prompted again.

F2 = Remain in interim recovery mode: The LUN's will be active/running, and as soon as you install a new drive, rebuild will start automaticly.

If you incorrectly select F1, and you can't boot, the simply reboot the server and selct F2.
If you select F1, and its only data LUN's thats disabled, then you can enable it on-line, using the ACU.

Accept or Kudo