You might want to use these commands to gather information now as the system is working OK:
ethtool -i eth0
ethtool -i eth1
lsmod | grep -e netxen_ -e nx_
Save the output of these commands to a file, e.g. "works.txt".
If the problem reappears, run the commands again, and save the results to another file, e.g. "fails.txt".
Comparing these two files might give a lot of clues about what is going wrong. The appropriate parts of the "dmesg" command output and /var/log/messages in the failing case wouldn't hurt either...
Based on what I've seen so far, I might *guess* the following:
- The driver for the 10GbE network adapter is apparently built up from several kernel modules, not just one. This is common for more complex hardware.
- Your system seems to have two sets of drivers for your 10GbE network adapter: the netxen_* set of modules, and the nx_* set. If the system tries to mix up these sets modules, it will probably fail: the system should be loading modules exclusively from either the netxen_* set or the nx_* set only.
As far as I can determine, the netxen_* set is the standard version included in the RHEL 5 distribution, and the nx_* set is provided by the HP driver RPM.
- The automatic configuration tools (like kudzu in RHEL) might have a built-in preference for the drivers included in the RHEL distribution, even though the HP-provided drivers might be better.
The "preference" might also be an accidental effect, caused by the loading order of things:
- When you install an updated kernel RPM, the modules in the HP driver RPM will need to be recompiled to match the updated kernel. Fortunately, the driver RPM probably includes a script that will do this automatically as necessary while the system is booting. But...
- If kudzu runs in the boot-up sequence *before* the module-recompilation script, it might "think" that your current NIC driver configuration is wrong (because the correct set of NIC drivers has not been recompiled yet), and start adjusting it... causing the configuration to break.
- Once the modules have been successfully recompiled and the sysadmin has fixed the configuration, the system will again be able to reboot without issues... until the next kernel upgrade happens.
Kudzu tries to add a bit of Artificial Intelligence to the hardware configuration of RHEL, but I've found it sometimes turns into Artificial Stupidity instead :)
If you want to use the HP-provided drivers, you may have to disable kudzu to stop it from changing the configuration on its own:
chkconfig kudzu off
A more appropriate fix might be to tweak the start-up order of kudzu vs. the HP RPM recompilation script, but I would want to know more about the situation before doing that.
MK
MK