Integrity Servers
1753681 Members
5611 Online
108799 Solutions
New Discussion юеВ

Re: BL890c i2 running 11.31 and HPVM 4.20 with 8 VMs, loses network connectivity and then crashes

 
Steve5
Regular Visitor

BL890c i2 running 11.31 and HPVM 4.20 with 8 VMs, loses network connectivity and then crashes

I thought I solved this problem by upgrading the patch level to Sep 2015. and upgrading the firmware but no.

Here's the sequence of events:

  1. The following appears on the console and in syslog:

Jan  2 20:20:19 molhpi24 vmunix: iexgbe4/1689, Microcode assert 00000100 00000020 00000040 00000080 00000100
Jan  2 20:20:19 molhpi24 vmunix: iexgbe4/1701, Microcode assert 0x100
Jan  2 20:22:56 molhpi24 vmunix: Dead gateway detection can't ping the last remaining default gateway at 0xa082001 .See ndd -h ip_ire_gw_probe for more info
Jan  2 20:23:16 molhpi24 xntpd[2545]: synchronisation lost
Jan  2 20:37:56 molhpi24 vmunix: Dead gateway detection can't ping the last remaining default gateway at 0xa082001 .See ndd -h ip_ire_gw_probe for more info^M
Jan  2 20:38:16 molhpi24  above message repeats 5 times
Jan  2 20:38:16 molhpi24  above message repeats 7 times

This has probably happened a dozen times in the past couyple of years.  Of course when the VM host loses network connectivity, so do the VMs.  I've tried everything I could think of but within maybe 15 minutes of me trying to shutdown the VMs, the blade server crashes.  And sometimes, it leaves a mess.

Thanks for reading this.  Anyone have any ideas?

Thanks,

Steve

6 REPLIES 6
Bill Hassell
Honored Contributor

Re: BL890c i2 running 11.31 and HPVM 4.20 with 8 VMs, loses network connectivity and then crashes

This is a well known (but very bad) default for HP-UX.

The network code regularly pings routers to see if they are alive (even though ping is a primitive and useless test). When the router fails to respond, the network code assumes that the router is dead and stops using that route (an even more useless action). It is not unusual for the network team to disable ICMP response (ie, ping) but with this gateway setting, all HP-UX routed traffic is halted because of a missed ping. Rebooting restores the connection again.

You need to set the dead gateway detect to off on *every* HP-UX server you have.

To make the change permanent, edit the file /etc/rc.config.d/nddconf  and add this:

 

TRANSPORT_NAME[0]=ip
NDD_NAME[0]=ip_ire_gw_probe
NDD_VALUE[0]=0

The above assumes that there are no [0] entries already in use in this script. If there are, use the next available array reference such as [1] or [2].

Then run: 

ndd -c

which reads the file and performs the settings.

This sets the value to 0 and also validates that the file is of the proper format.

(Did I mention that *every* HP-UX server including vPars and VMs (any OS version) needs this fix?)



Bill Hassell, sysadmin
Steve5
Regular Visitor

Re: BL890c i2 running 11.31 and HPVM 4.20 with 8 VMs, loses network connectivity and then crashes

Hi Bill,

Thanks, I never heard of that fix before, but it makes sense.  I am implemented on all systems.

Any ideas on what is causing the port to go offline?

Thanks,

Steve

Bill Hassell
Honored Contributor

Re: BL890c i2 running 11.31 and HPVM 4.20 with 8 VMs, loses network connectivity and then crashes

As mentioned, HP-UX will stop using all routes when all the routers fail to respond to ping. Technically, the system is not offline as it will respond to other systems that are on the same subnet. Systems that are on other subnets will require a router (gateway) to communicate and this will fail since routing has been disabled.

As far as why this feature even exists has never been explained to my knowledge.



Bill Hassell, sysadmin
Steve5
Regular Visitor

Re: BL890c i2 running 11.31 and HPVM 4.20 with 8 VMs, loses network connectivity and then crashes

Sorry, I wasn't clear.

Neither the VMs nor the host are pingable from another system on the same network.   It is my belief that the network port is down.

Bill Hassell
Honored Contributor

Re: BL890c i2 running 11.31 and HPVM 4.20 with 8 VMs, loses network connectivity and then crashes

So the first question is: can you connect to your console port? This is a separate network connection with a separate IP address. It is almost impossible to troubleshoot the problem without this connection. Since it is an imbedded microcomputer, it is unaffected by HP-UX problems such as dead gateway detect. It is often refered to as the iLO port. When you connect to the port (telnet), you can view hardware status and logs and also connect to the HP-UX console . From there you can verify the state of the network connection. The OA (Onboard Admin) can setup the console IP addresses for each blade. You can also connect to the console through the blade's KVM/iLO port (special dongle required). This is a serial connection.



Bill Hassell, sysadmin
Steve5
Regular Visitor

Re: BL890c i2 running 11.31 and HPVM 4.20 with 8 VMs, loses network connectivity and then crashes

Hi Bill,.

I was able to access the Blade's console and observed a continuing error message on the console that the system could not reach its NIS server.  This is server is on the same network on the blade..  I then tested ping to other hosts on the same network with the same result; no response.

Here are the relevent logs from the Virtual Connect

2018-01-03T07:03:21-05:00 VCEFTW20120152 vcmd: [SVR:enc0:dev1:5016:Minor] Server state DEGRADED : Component partially operational, but capacity lost, Previous: Server state OK
2018-01-03T07:03:21-05:00 VCEFTW20120152 vcmd: [ENC:enc0:2014:Minor] Enclosure state DEGRADED : Some Enet modules & servers not OK, Previous: Enclosure state OK
2018-01-03T07:03:21-05:00 VCEFTW20120152 vcmd: [VCD:HPBC1_vc_domain:1024:Minor] Domain state DEGRADED : 1+ enclosures & profiles OK, DEGRADED, UNKNOWN, NOT-MAPPED, Previous: Domain state OK
2018-01-03T07:03:28-05:00 VCEFTW20120152 vcmd: [VCD:HPBC1_vc_domain:1032:Warning] VCM remote session is invalid or has expired : hpvcd:showManagedObjects ([UNKNOWN]@[LOCAL])
2018-01-03T07:03:28-05:00 VCEFTW20120152 vcmd: [VCD:HPBC1_vc_domain:1032:Warning] VCM remote session is invalid or has expired : hpvcm:retrieveStateChangeCounters ([UNKNOWN]@[LOCAL])
2018-01-03T07:04:24-05:00 VCEFTW20120152 vcmd: [SVR:enc0:dev1:5012:Critical] Server state FAILED : Component is not operational due to an error, Previous: Server state DEGRADED
2018-01-03T07:04:24-05:00 VCEFTW20120152 vcmd: [PRO:molhpi24-BL890c-i2:6012:Critical] Profile state FAILED : Server [enc0:devbay1] state not OK: [VCM_OP_STATE_FAILED], Previous: Enet Network state OK
2018-01-03T07:12:20-05:00 VCEFTW20120152 vcmd: [SVR:enc0:dev1:5010:Info] Server state OK : Component fully operational, Previous: Server state FAILED
2018-01-03T07:12:20-05:00 VCEFTW20120152 vcmd: [NET:molhpi24-BL890c-i2:7010:Info] Enet Network state OK : All connections, PhysicalServer OK, Previous: Profile state FAILED
2018-01-03T07:12:20-05:00 VCEFTW20120152 vcmd: [ENC:enc0:2010:Info] Enclosure state OK : All modules & servers OK, Previous: Enclosure state DEGRADED
2018-01-03T07:12:20-05:00 VCEFTW20120152 vcmd: [VCD:HPBC1_vc_domain:1020:Info] Domain state OK : All enclosures & profiles OK, Previous: Domain state DEGRADED

Thanks,

Steve