ProLiant Servers (ML,DL,SL)
cancel
Showing results for 
Search instead for 
Did you mean: 

DL580 G5 disconnecting a NIC causes sytem reset

 
John McNulty_2
Frequent Advisor

DL580 G5 disconnecting a NIC causes sytem reset

We have a strange problem on at least two DL580 servers, possibly two more as well but I've not confirmed it yet.

All systems are configured the same, with 2 x HP NC360T PCI Express Dual Port Gigabit Server Adapters in slots 6 and 8. The systems are running Redhat 5.3. The adapter part number is: 412648-B2.

All 6 NICs are in use, configured into three bonded (teamed) pairs. Due to the scan order at boot time (right to left) NIC 6A is the first port seen so that's assigned to eth0. 8A is assigned to eth2.

When performing resilience tests we discovered that sometimes when disconnecting eth0 the systems will just reset. There's one message on the console: the usual message indicating the NIC has gone down and then the console freezes (instantly) and the system does a reset and reboot. No operating system crash, panic or oops messages, nothing. It's really strange. If it were a driver bug of some kind I would expect to see at least something but this looks like it's very low level.

It's not 100% repeatable. Sometimes it works fine and the active NIC in the bonded pair transfers to the other NIC, but sometimes it happens a couple of times in a row after the system has rebooted.

Redhat sees the cards as:

description: Ethernet interface
product: 82571EB Gigabit Ethernet Controller
vendor: Intel Corporation
physical id: 0
bus info: pci@0000:13:00.0
logical name: eth0
version: 06
serial: 00:24:81:7d:fd:32
size: 1GB/s
capacity: 1GB/s
width: 32 bits
clock: 33MHz
capabilities: pm msi pciexpress bus_master cap_list rom ethernet physical tp 10bt 10bt-fd 100bt 100bt-fd 1000bt-fd autonegotiation configuration: autonegotiation=on broadcast=yes driver=e1000e driverversion=0.4.1.12-NAPI duplex=full
firmware=5.1 1-2 latency=0 link=yes multicast=yes port=twisted pair slave=yes speed=1GB/s
resources: irq:147 memory:fdae0000-fdafffff memory:fdac0000-fdadffff ioport:7000(size=32) memory:d1500000-d151ffff (prefetchable)

An example output from ethtool is:

# ethtool eth0
Settings for eth0:
Supported ports: [ TP ]
Supported link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Supports auto-negotiation: Yes
Advertised link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Advertised auto-negotiation: Yes
Speed: 1000Mb/s
Duplex: Full
Port: Twisted Pair
PHYAD: 1
Transceiver: internal
Auto-negotiation: on
Supports Wake-on: pumbag
Wake-on: g
Current message level: 0x00000001 (1)
Link detected: yes


They are bonded together in Redhat thus:

# cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.2.4 (January 28, 2008)

Bonding Mode: transmit load balancing
Primary Slave: None
Currently Active Slave: eth0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: eth0
MII Status: up
Link Failure Count: 1
Permanent HW addr: 00:24:81:7d:fd:32

Slave Interface: eth2
MII Status: up
Link Failure Count: 1
Permanent HW addr: 00:24:81:7d:fb:80

The cards are configured into the kernel (as shown in /etc/modprobe.conf) is:

alias eth0 e1000e
alias eth1 e1000e
alias eth2 e1000e
alias eth3 e1000e
alias eth4 bnx2
alias eth5 bnx2
alias bond0 bonding
options bond0 miimon=100 mode=5
alias bond1 bonding
options bond1 miimon=100 mode=5
alias bond2 bonding
options bond2 miimon=100 mode=5

7 REPLIES
Viveki
Trusted Contributor

Re: DL580 G5 disconnecting a NIC causes sytem reset

Hi John,

May be I am wrong, but you need to check the health of the power sources at site. We have seen this kind of issue once when there was an issue with the earth voltage in either system side or in the switch side.

This is just for your information and you can just try.
John McNulty_2
Frequent Advisor

Re: DL580 G5 disconnecting a NIC causes sytem reset

Ok. Thanks for the tip, will look into that.

Some more information: HP sent a replacement NIC card for one of the systems. I didn't think it would make any difference, but tried replacing it anyway. No change.

We tried disconnecting the NIC cables on all the other ports eth1-eth5 repeatedly and were not able to reproduce it, but each time we pulled eth0 the system reset.

So it only seems to manifest on the first port the system scans at boot time: slot 6, port A.

John McNulty_2
Frequent Advisor

Re: DL580 G5 disconnecting a NIC causes sytem reset

We've had the cab earthing checked and it wasn't that good so we had earth straps fitted to all of them. No change.

We've tested with just the mains power connected, and also with just the UPS power connected. No change.

We've switched from using bonding mode=5 (transmit load balancing) to mode=1 (active-backup). No change.

However, if we first disable eth0 in software (ifdown eth0) before popping the cable out then the system stays up. And it seems you can pop the cable in and out till christmas with no issues.

I've tried disabling bond0 and running just eth0 on its own and the issue goes away too. So on the face of it, it's looking more like a Redhat kernel/driver issue that's manifesting with bonding and with this particular mix of hardware.
John McNulty_2
Frequent Advisor

Re: DL580 G5 disconnecting a NIC causes sytem reset


Have just noticed there's an HP 14th Sept e1000e driver update for download. Am installing that now.
John McNulty_2
Frequent Advisor

Re: DL580 G5 disconnecting a NIC causes sytem reset

The e1000e driver has been updated to 1.0.2.3-NAPI. Repeated the test this morning: no change.
John McNulty_2
Frequent Advisor

Re: DL580 G5 disconnecting a NIC causes sytem reset


Today we've moved the card in slot 8 to slot 10, and the card in slot 6 to slot 8. This puts the offending NIC port (6a) into a known working position (8A) and also eliminates one PCI-E bus completely.

Re-ran the test. No change.

The only thing I can think of now is to try and source a different dual port card with a different chip set that requires a different driver.

Anyone got any other ideas, cos I'm running out of them quickly.

John McNulty_2
Frequent Advisor

Re: DL580 G5 disconnecting a NIC causes sytem reset


Good news.

Today we got an NC382T replacement to try out. So we put the cards back in their original slots with the first NC360T replaced with the NC382T Broadcom card. I installed the latest netxtreme2-5.0.17-1 driver and re-ran the tests (lots of times).

The problem has completely gone away.