Operating System - Linux
1827775 Members
2552 Online
109969 Solutions
New Discussion

Dualing SmartArray controllers and bootloader (RHEL 4)

 
SOLVED
Go to solution
Jared Middleton
Frequent Advisor

Dualing SmartArray controllers and bootloader (RHEL 4)

Problem: Boot displays "Attempting boot From Hard Drive (C:)" and hangs.

HP ProLiant DL580 G5
Smart Array E200
Smart Array P600

The E200, which I think was the 'included' (integrated) controller. The E200 (PCI-Express, in Slot 3) is intended to control the two internal SAS drives (RAID1)... where the Linux OS will be installed.

The P600 (64-Bit PCI, in Slot 4) was add-on to control the external 10-disk SAS HDD array, various RAID1 and RAID1+0 combos... where databases and applications will be installed.

First symptom: noticed within SmartStart CD's Array Configuration Utility ("More Information") that the Logical Device Name for the E200 was: /dev/cciss/c1d0 and the P600 had: /dev/cciss/c0d0, /dev/cciss/c0d1, /dev/cciss/c0d2. This was different, because the other 5 Linux boxes I admin all had the lower number (c0d0) assigned to the internal controller. But, I said to self 'whateveeeer'.

In BIOS, changed controller boot order so that the E200 was before the P600.

Ran a quick "test" Linux install (GUI interactive) just as far as Disk Druid to see how it viewed the logical disks. Sure enough, it saw the E200 as c1d0.

I adjusted my Kickstart script partitioning specs accordingly, such that the OS file systems (/boot, swap, /, etc.) would all use the logical disk on controller c1d0.

Now, skipping forward: reboot, install via kickstart, reboot... encountered problem noted at start.

Post-mortem diagnosis: examination of the system via boot DVD and "linux rescue" mode reveals GRUB config referencing "hd(3,0)" and "cciss/c0d0" in the comments.

Long story short... If RHEL installed a bootloader (GRUB) at all, I believe it was on the (P600) external drive array controller, the one set as 2nd in BIOS boot order but which had the c0d0 logical device name... and maybe even on the 3rd logical drive of that?!?!

I did lots of Google searching, experimentation, and hacking to try to get this system to boot with no success. I finally called my on-site guy (another country) and asked him to remove the P600 controller.

Rebooting into SmartStart now showed the E200 array with the more normal device ID: /dev/cciss/c0d0. So, after revising my Kickstart cfg back to use c0d0, I proceeded to re-install RHEL Linux from DVD and the install went perfectly. System reboots loads OS off hard disk successfully.

NOW: What is going to happen when I replace the P600 back? Is the system going to re-scan the PCI slots and renumber the controllers (reversing the device IDs) and muck up everything? Will I need to hack OS config files (e.g. /boot/grub.conf, /etc/fstab, etc.) to point to c1d0?
I need a plan of attack! :-)

Thanks in advance,
Jared Middleton
3 REPLIES 3
Matti_Kurkela
Honored Contributor
Solution

Re: Dualing SmartArray controllers and bootloader (RHEL 4)

Looks like you've not using LVM.

My experience has been that LVM is not fazed by any changes of device names: if the controller drivers are available, LVM will auto-identify the correct disks/partitions, wherever they might be located. That leaves only the bootloader, and the /boot partition.

There are ways to make a traditionally-partitioned system behave as robustly: do a "man fstab" and search for words "LABEL" and "UUID". You can use these to identify the partitions in a device-agnostic way.

To understand the behavior of the bootloader, you'll need to be aware of some traditional features of the BIOS. The bootloader is not aware of controllers: it just sees a list of disks. The traditional behavior is to assume that the boot disk is the first one on that list. Any deviation from this makes things more complicated to handle.

In fact, the standard way for choosing the disk to boot from at the BIOS level is to manipulate this list, so that the desired boot disk goes at the top of the list. So, changing the boot controller options at the BIOS level is likely to change the way GRUB sees things... and to thoroughly confuse an unprepared sysadmin.

On the other hand, Linux won't necessarily have any information about the disk order as seen by the BIOS and the bootloader. Each driver can choose how to number the devices it handles, but usually the detection of storage devices happens in the PCI bus order. Check the output of the "lspci" command.
(This means the ordering between the P600 and the E200 might be changed by sticking the P600 into a different slot.)

So the GRUB installation program must essentially make some educated guesses. You saw "(hd3,0)" with the comment "cciss/c0d0". This is the installer's documentation about the guesses it made.

The guesses made by the installer:
1.) The /boot partition is located on the first partition of the fourth disk in the BIOS list (GRUB uses zero-based counting).
2.) This disk is known by Linux as /dev/cciss/c0d0.

In a system with multiple disk controllers, the installer can easily get these guesses wrong. In such a system, you may have to help GRUB out: use the "grub --device-map" command to enter the GRUB shell. On the first time, the GRUB shell will create the /boot/grub/device.map file if it does not already exist. That file allows you to verify and/or correct the guesswork made by the installer. Each line in that file will have a GRUB disk identifier and the corresponding Linux device name. If the initial guesses are wrong, edit the device.map file: if the file exists, the GRUB installer will use the information in it instead of guesses.

As you see, the handling of multiple disk controllers in a PC architecture can be a bit of a dark art - and at the moment, I'm getting too tired for a coherent explanation. Please ask for more details if necessary: I'll try to look at this thread again tomorrow.

MK
MK
Jared Middleton
Frequent Advisor

Re: Dualing SmartArray controllers and bootloader (RHEL 4)

Matti, you confirmed some things I'd read or assumed. It might be a day or two before I add the P600 controller back in and report my status/results.

I didn't see any BIOS option for changing order of logical disk/devices (e.g. at the /dev/cciss/cXdX level), only the order of the controllers themselves. It's set to something like:
1) E200 controller <-- internal disks (OS)
2) Integrated IDE controller <-- for DVD-ROM?
3) P600 controller <-- external disk array
4) SCSI controller <-- for tape drive

Note: The other BIOS option is for ordering: CDROM, hard drive, USB, NIC, etc.

My wish was/is for the E200 to boot first, show as cciss/c0XXXX in Linux, and thus match my other systems to avoid potential mistakes down the line. With the P600 present, the P600 got the c0 designation (reverse of what I wanted/expected)... probably based on the slot it's in and/or the PCI bus scan order. I wrongly assumed Linux would reflect the BIOS order (shown above).

At the moment, the system is working fine on just the E200 (as cciss/c0XXXX), but once the P600 is added back (assume: same slot), I expect it might steal the c0 assignment and force me into some device-map tweaking so that GRUB knows where the bootloader is.
Fun Fun. :-)

-Jared
Jimmy Vance
HPE Pro

Re: Dualing SmartArray controllers and bootloader (RHEL 4)

Your running into a PCI enumeration issue that showed up in the 2.6 kernel. The 2.4 kernel did a breadth-first sort of the PCI bus, the 2.6 kernel does a depth-first sort on the PCI bus. an option was added in RHEL4U5 to address the issue. On your kernel boot line add pci=bfsort and the 200i should show up as c0d0. Most people see the bus enumeration issue on the network cards, what they think should be eth0 shows up as eth1. Even with the pci=bfsort option, if at a later date you add a 3rd controller what was c1d0 might become c2d0 depending on where the new controller shows up on the bus. If you use labels instead of device names this one shouldn't bite you.

Red Hat has "whitelisted" some of the systems that this effects, with a patch in pci.c, but the DL580 G5 isn't in the list yet. The patch basically forces the listed systems to use the pci=bfsort option

To fix your current situation, after you add the P600 back into the mix try the pci=bfsort option if your running RHEL4U5 or later. Worst case you have to boot into rescue mode and edit the files you mentioned.

No support by private messages. Please ask the forum!