- Integrated Systems
- About Us
- Integrated Systems
- About Us
04-16-2013 05:32 PM
We have a DL320 G5 with embedded "softraid" (ICH7R) on 2 SATA 160GB disks in RAID1. This is the default setup for Cisco Call Manager (CUCM - 7800 series) running a flavour of *nix (ugly softraid!). One of the disks failed and surprisingly the whole system went down in flames (kernel stuck in an endless loop recurring write operation). Anyhow, upon forced reboot even more surprisingly the BIOS did not install INT13 (!) so there was no way for the machine to boot from any of the drives. We ordered a new replacement 160GB drive and swapped it with the old one and assigned it as spare in the BIOS thinking that it would automatically boot and start rebuild - we were wrong: the BIOS still doesn't install INT13 so the system fails to boot. Pressing CTRL+R inside the SATA RAID BIOS to initiate rebuild it fails to start with an error to check spare drives and array type. It looks like the BIOS won't install INT13 as long as the RAID1 volume is sound!
The system BIOS is updated to the latest version (W04) and no other updates are available.
Booting with a KnoppixLIVE CD detects the drive and all partitions correctly - all data is there on the remaining drive.
The BIOS will only install INT13 if the new drive is not assigned as spare but left as is: it will be automatically added as JBOD by the BIOS at boot and INT13 will be installed. However as it is empty it will fail to boot of course. I cloned the good RAID1 drive using ddrescue to the new empty drive and after the clone the drive is naturally seen by the controller as an identical drive to the source one and it sees it as in a failed RAID1 volume and will fail to boot. The only way to boot this drive is to disable the SATA RAID from the BIOS. This way, with just this drive in the bay, the system will load GRUB and boot the kernel: however it will kernel panic as it won't find /dev/sda1 (which is the correct root partition to be mounted btw). It complains a few lines before saying that the adpahci.o module won't load - I suspect this is due to the fact that the RAID option in the BIOS has been disabled so the module won't load. This in turn makes the kernel panic as it cannot mount the root partition not being able to detect it without this driver as the kernel is compiled this way. It's curious though that the KnoppixLive CD detects the partitions correctly without any particular parameters but I suspect it is because it has broader support for all hardware.
Now my question is: firstly, with a failed drive in a RAID1 volume, why doesn't the BIOS still install INT13 to begin with? Of course the volume is degraded, but not dead: one drive is perfectly functional! Is this a major bug? I have used softraid setups before (reluctantly as they are pretty unstable) and the rebuild of the degraded volume starts AFTER the system has rebooted usually through intel's storage matrix utility or automatically based on the version. In Linux this is usually handled by the mdraid utility, but still, the SYSTEM MUST BOOT before one can do anything. At max, the system won't boot if one forgets to install GRUB on the second disk in the RAID1 volume which is typical - but that's another story as INT13 is installed and the BIOS handles the first cold boot sequence calls to load the loader.
At the moment, the only way out I see is to get the second disk cloned completely (with the RAID's metadata of the first) and edit this data with an editor to fix the failed flags adding the correct disk ID - not an easy task without knowing exactly what bytes to modify and correct checksum. Alternatively if I delete the RAID1 volume all the RAID metadata will be wiped together with the partition table but I can backup the partition table with dd to a file on a USB key so it would be possible to wipe the volume and recreate it from BIOS and subsequently restore the partition table back to both the disks. Theoretically this would work as the metadata should be intact (partition info/MBR reside in the first 512 bytes of the disk). I have already tried wiping the RAID volume on the cloned disk and restoring the partition info successfully - the disk is seen as a simple disk with no metadata by the SATA RAID BIOS and all data is intact.
Still I think this is a major pain in the ass situation as I would think that someone in hp would have tested a failed RAID1 situation on this server (linux/windows doesn't matter, if INT13 doesn't get installed nothing short of a miracle will make the system boot). Before I do anything more I'd like a confirmation or any other suggested method or feedback before I go ahead and wipe the metadata/RAID1 volume and recreate it restoring partition info. I see this as a last resort hack to bypass this major problem.
Thanks in advance for any feedback.
Solved! Go to Solution.
04-16-2013 05:53 PM
Re: DL320 G5 integrated Intel ICH7R SATA RAID - Fails to install INT13 on degraded RAID1 volume!
INT 13 is an interupt call, it doesn't get installed on anything.
When using "fakeraid" / "softraid" and *nix, you need to manually install the bootloader on the 2nd disk in the mirror. If this isn't done you pretty much end up in the situation your in now. You might try using your liveCD to install the bootloader on the good disk.
If your other disk wasn't failed, your liveCD, would see 2 disks with the same information, the special driver is what makes the raid function.
My standard disclaimer:
Always have a good backup before you start working with disks and arrays
No support by private messages. Please ask the forum! I work for HPE
If you feel this was helpful please click the KUDOS! thumb below!
04-23-2013 03:05 PM - edited 04-23-2013 03:11 PMSolution
firsly, thank you for taking the time to reply. I have recovered the situation using a hack or workaround which I have used before on softraids and some hardraids (I loved the NetRAIDs utils) though did not want to risk trying it here - but it worked. I will explain myself better.
Naturally INT13 doesn't physically get installed, it is the controller's BIOS that installs itself in memory in order to control INT13 for the disk calls - I phrased that wrong because most BIOSes state "INT13 not installed" instead of "BIOS not installed on INT13" such as typical SCSI Adaptec HBAs - sorry.
I must say that I have extensively worked with HBAs and RAID volumes (even the first HBAs on ISA buses), so I am privvy to how these things work.
Having cleared this up, if the HBA's BIOS does not get loaded to handle INT13 ("installed on INT13") there is no way the volume can boot - whether you have any loader on any device or not. The controller in my case did see the degraded RAID1 volume but it did not "install" itself to handle the system BOOT. Legacy IDE adapters are handled slightly differently, but we are talking about a SATA (similar to SCSI handling regarding the calls by "additional" code) adapter.
GRUB was installed on both disks: I have handled many md/lvm volumes on Linux installations with the menu.conf file (or lilo) correctly configured with both root mount devices specified in order to quickly reboot with the good drive using the menu (all raid handled by the OS, no hybrid solution). I have also used pure hardware RAID (smartarray cciss and Netraid, Promise, etc) setups without problems.
This hp server is a pre-prepared "appliance" that comes shipped from Cisco for it's Unified Communication software (Cisco Call manager). This can come in different flavours, such as on Windows server OS or *nix, HP and IBM servers, depending on the required specs. Specifically, this model came with a Redhat based *nix with custom kernel built by hp and Cisco CCM software on a RAID1 softraid using the embedded ICH7 (pseudo) RAID support. HP decided to use the "adpahci" (Adaptec) AHCI module to handle the softraid calls to the ICH7 RAID controller in this case.
My client had a twin configuration with 2 of these identical servers in a CCM cluster, hence no downtime or data loss ever experienced. The cluster worked fine and the other machine took over all the 90+ phone registrations when the first one failed.
For community informational purposes I will state my workaround on how I got the RAID1 up and running again.
DISCLAIMER - Use this information at your own risk. There are several factors that can change between systems and or mistakes that you can make to render any further recovery impossible. Do this at your own risk. If you have no familiarity with these commands, ask an experienced tech for help.
In my case, I did this using a Knoppix LiveCD which I often use for booting "generic" systems where I have no specific rescue disk and works very well (it works fine even through the graphical ILO but remember to disable DRM on boot if you prefer GUI environments).
The steps I took are:
- cloned the good disk on the new disk (I suggest using ddrescue, very fast and effective in handling possible read errors - however the source disk was intact and clean - keep this in mind before cloning)
- backed up the MBR of the good disk on a USB key (not going into technical details of loader sectors and partition sectors, just save the first 512bytes of track 0 sector 0 using dd with count and bs options - plenty of references on how to do this on the net)
- *destroyed* the RAID1 Metadata by deleting the RAID1 config from the BIOS
- recreated the exact same RAID1 config from BIOS selecting QUICKINIT - MOST IMPORTANT! The previous good disk and the new disk this way have RAID metadata chksum signature information correctly rewritten to Metadata sectors but NO DATA SECTORS are overwritten IF the RAID configuration is identical and DISK size does not change - beware - this may not work for all controllers for reasons that go beyond the scope of this post!).
- booted back to LIVECD and while doing checked that BIOS successfully installed on INT13 which it now did reporting a healthy RAID1 volume
- restored previously saved MBR to BOTH disks and checked (fisk -l) that partitions were alive and well (gparted) by mounting and fscking them. All clean. If any errors occur DO NOT proceed to boot. Fix the errors on the source (good) disk and redo the clone.
- Reinstalled GRUB on both disks (grub-install) specifying correct root devices which in my case were /dev/sda and /dev/sdb (check grub documentation)
- Removed live CD and rebooted. System went up as if it never had crashed.
Additinoally, I also suggest booting using the ROOT=LABEL=/ option to select the root partition to mount in GRUB menu (there is an already specified kernel with this config though it is not the default depending also on the version of your CCM). This way, you are certain to mount the correct root partition as the label is the same between both disks (the partition descriptors differ as one is /dev/sda1 and the other is /dev/sdb1 though the lables are both root).
If you ever need to recover a RAID1 that fails to boot because of the above reasons so the rebuild is impossible and have no other option (and have your data backed up and you want to cut the recovery/restore or reinstall time) then do this at your own risk and keep your fingers crossed.