Operating System - HP-UX
1839264 Members
4152 Online
110137 Solutions
New Discussion

Re: VM loses network connection

 
SOLVED
Go to solution
Jeromejay
Advisor

VM loses network connection

Hi all,

 

So, we built 2 VM on a BL860C blade, and everything is working fine: both are fully configured and running, no issues there.

However, after some time (ranging between 2h and a a few weeks), one of the 2 VM loses its network connection completely...

 

Things we have seen:

- from the VM console, /sbin/init.d/net stop  and /sbin/init.d/net start does not solve the issue

- rebooting the VM solves the issue

- restarting the virtual switch solves the issue

 

 

I don't think this issue can be solved on the spot with those infos (plus the ones below)... but my question is then:

=> what more can we check ?

we've checked logs (see below), NIC status, Ip status ... we can't find anything relevant.

ie: do you have specific commands for network troubleshooting we could use ?

 

Thanks for your help !

 

 

Some more info:

 

There are mostly no logs on either side, except in the VM syslog, which seems to be a result of the issue:

Jan 30 09:54:21 soem2 vmunix: Dead gateway detection can't ping the last remaining default gateway at 0xa63xxc01 .See ndd -h ip_ire_gw_probe for more info

 

 

bash-4.0# nwmgr -c lan0 -v
lan0:
   Interface State =UP
   MAC Address = 0xCA5xxx75B9A
   Subsystem = igssn
   Interface Type = 1000Base-T
   Hardware Path = 0/0/1/0
   NMID = 1
   Feature Capabilities = Physical Interface
                          IPV4 Recv CKO
                          IPV4 Send CKO
                          VLAN Tag Offload
                          64Bit MIB Support
                          IPV4 TCP Segmentation Offload
                          UDP Multifrag CKO
   Feature Settings = Physical Interface
                      IPV4 Recv CKO
                      IPV4 Send CKO
                      VLAN Tag Offload
                      64Bit MIB Support
                      IPV4 TCP Segmentation Offload
                      UDP Multifrag CKO
   MTU = 1500
   Speed = 1 Gbps Full Duplex (Autonegotiation : On)

15 REPLIES 15
Stan_M
HPE Pro

Re: VM loses network connection

You did not provide HPVM version, AVIO drivers version nor any details about interface to which the vswitch is connected.

So we can speak only on a generic level - make sure to have the latest AVIO driver on both host and guest as well

as up to date driver for the underlying physical NIC on the host.

I work for HPE

Re: VM loses network connection


@Jeromejay wrote:

Jan 30 09:54:21 soem2 vmunix: Dead gateway detection can't ping the last remaining default gateway at 0xa63xxc01 .See ndd -h ip_ire_gw_probe for more info

 

 

Looks like you need the following in your /etc/rc.config.d/nddconf :

 

TRANSPORT_NAME[3]=ip
NDD_NAME[3]=ip_ire_gw_probe
NDD_VALUE[3]=0

 

Adjust "[3]" to your other entries.

 

Cheers,

Jeromejay
Advisor

Re: VM loses network connection

"make sure to have the latest AVIO driver on both host and guest as well

as up to date driver for the underlying physical NIC on the host"
Thanks: I'm checking now ...
that's exactly this kind of advice I needed ;)
Jeromejay
Advisor

Re: VM loses network connection

TRANSPORT_NAME[3]=ip
NDD_NAME[3]=ip_ire_gw_probe
NDD_VALUE[3]=0

From what I understand, the log error message is more a consequence than a cause...
Changing this will only remove the detection
Bill Hassell
Honored Contributor
Solution

Re: VM loses network connection

Dead gateway detection...

 

I have found this on dozens of 'hung' systems causing hours of downtime and unnecessary reboots.

 

Turn it OFF.!

 

What is happening is that HP-UX will ping each of the gateways about every 3 -4 minutes and if the gateway fails to respond (or more likely, the ICMP packet gets lost), the network is immediately disabled, a very bad thing for any production system.  And some network administrators may decide to turn off ping response from gateways as a security measure, which means that every HP-UX system with dead gateway detection enabled will disappear from the network, usually resulting in mass panic from the end users and the desperate system administrator will reset (crash) the system to reboot.

 

This is yet another reason to verify that 100% of your systems had GSP/MP network access, a known to work LAN connection that is *NOT* affected by the dead gateway mess. By logging in over the console, you can determine that the system is NOT hung, but simply off the network.



Bill Hassell, sysadmin
Jeromejay
Advisor

Re: VM loses network connection

Thanks for the full info !

that's appreciated ;)

 

also: I really thought the error message was a consequence, whereas it's actually the cause ...

 

So now, I'm on to re-configuring all our servers :/

Patrick Wallek
Honored Contributor

Re: VM loses network connection

The dead gateway detection turning off the network will definitely cause you problems.

 

I have seen cases where the network was REALLY REALLY busy (in one case doing a backup over the network of an NFS mounted filesystem with a single interface) which likely caused the dead gateway detection ping to fail, thus causing the network to go down.

 

Basically this is a heads up for you that when you turn off the dead gateway detection, you may start seeing other symptoms on this VM, which may have been masked becuase the network was disabled.

Patrick Wallek
Honored Contributor

Re: VM loses network connection

Additionally, you can manually set the ip_ire_gw_probe value from the command line:

 

# ndd -set /dev/ip ip_ire_gw_probe 0

 

The above will set the value to '0' (disabled).  To check the value:

 

# ndd -get /dev/ip ip_ire_gw_probe

0

 

The instructions given above with setting up nddconf will only set the value when the system is rebooted, which is desireable.  But if you can't reboot the system, then use the ndd command to set the value now.

Jeromejay
Advisor

Re: VM loses network connection

Hi again,

 

so before making any changes accross all servers, and because I have some time, I thought I'd go for a quick test first:

 

- I blocked outgoing ICMP on the server (using firewall).

- After the expected ~3min, I started getting the Error messages about Dead Gateway ... 

- but the network connectivity was still there (I can still SSH, and HTTP to the server).

 

so:

- either my quick test is flawed

- either the error message is a consequence of the server dropping its network connectivity (ie: something else fails, and then the server can't ping the GW, and displays the message).

 

In case it's the 2nd option, could you give me an exhaustive list of checks I can do, for network investigation ? (my knowledge stops at ping, nwmgr basic commands, netstat, lsof, log investigation, lanscan)

 

thanks again for all the tips and explanations !

Patrick Wallek
Honored Contributor

Re: VM loses network connection

>>but the network connectivity was still there (I can still SSH, and HTTP to the server).

 

Where were you SSH'ing from?  Were you on the same network segment as the VM (where you DO NOT have to go through the router)?  If so, the fact that you can SSH and HTTP makes sense.

 

Things to check:

 

netstat -in

netstat -rn 

ping a server on the same subnet

ping  the router

ping something on a different network subnet

 

 

Jeromejay
Advisor

Re: VM loses network connection

I was SSHing from another network, and HTTP from yet another one ... which, in short, implies that the network connectivity was still there (as opposite to the original outage, where everything was down).

 

Also: I could not ping anything, since I blocked ping outbound (maybe I should have blocked the GW IP only).

 

 

Thinking back on it:

as mentionned in my original post: restarting the virtual switch on the physical host solved the issue for the Guest ... would that indication tell us that the Dead GW detection was a consequence, and not the issue ?

 

Moreover: we know for sure that the GW is fine (reliable server, used by many other servers). If our faulty server failed 1 ping to the GW, and activate the infamous Dead GW detection by stopping using this route ... would it not come back on the next succesful ping ? (I assume it keeps on trying, since we have error messages every 183seconds).

 

 

All in all: the more I think on it, the more I think the Dead GW detection error message is a consequence of another failure ...

 

note: still in the process of updating the AVIO drivers

 

Patrick Wallek
Honored Contributor

Re: VM loses network connection

>>restarting the virtual switch on the physical host solved the issue for the Guest ... would that indication tell us that the Dead GW detection was a consequence, and not the issue?

 

I would think so, yes.  

 

Is there a way to check statistics for the virtual switch?  Things like packets in, packets out, number of errors, etc?

 

>>would it not come back on the next succesful ping ?

 

I don't think it does.  I think once it is disabled, it stays that way.  I could be wrong though...

 

>>the more I think the Dead GW detection error message is a consequence...

 

I tend to agree.  A ping is a pretty low level check.  If the network is so busy that a ping is dropped, then I would think there are other issues.

Jeromejay
Advisor

Re: VM loses network connection

>>>>would it not come back on the next succesful ping ?

 >>I don't think it does.  I think once it is disabled, it stays that way.  I could be wrong though...

 

no offence, but I hope you're wrong :) (I can't conceive HP would have done something that stupid).

Also: since the error repeats every 183sec in the log file, I guess it keeps on trying.

 

 

As for checking the virtual switch: I should have done it before the restart ... (like any other investigation ...).

 

As usual in this case: I'm not sure if I hope it happens again so I can investigate, or if I hope it never happens again .

Patrick Wallek
Honored Contributor

Re: VM loses network connection

I have been looking into the dead gateway detection some more and found something I was not aware of.

 

Supposedly if the dead gateway is the last default gateway it will remain enabled, but a message will still be logged.

 

To check the status of a gateway:

 

# ndd -get /dev/ip ip_ire_status | grep -e IRE_GATEWAY -e flag

 

I cannot find anything definitive about re-enabling a gateway, but the following from 'ndd -h' indicates that it should:

 

# ndd -h ip_ire_gw_probe_interval

ip_ire_gw_probe_interval:

Controls the probe interval for Dead Gateway Detection.
IP periodically probes active and dead gateways.
ip_ire_gw_probe_interval controls the frequency of probing.
With retries, the maximum time to detect a dead gateway is ip_ire_gw_probe_interval + 10000 milliseconds. 
Maximum time to detect that a dead gateway has come back to life is ip_ire_gw_probe_interval. [15000,- ] Default: 180000 (3 minutes)

 

 

Jeromejay
Advisor

Re: VM loses network connection

Thank you so much for the additional information !

I'll keep the command to check the Dead Gateway status ... although, since the error message is logged, I can already guess the results.

 

note: too bad, the server is still up and running

 

 

ps: I forgot to add: the other VM on the same host has the same IP settings than the one failing ... if one detects the GW as down, the other "should" maybe do the same

 

Thanks again !