Operating System - HP-UX
1837033 Members
3096 Online
110111 Solutions
New Discussion

serviceguard marks nic down after subnet failure

 
Robert Meredith
Occasional Contributor

serviceguard marks nic down after subnet failure

When testing our SG cluster (SG v10.12, HP-UX 10.20, Patched to June2001) we pulled both the network connections to test package failure. (ie the heartbeat lan was still alive).

On returning the network connections the primary lan is ok but in cmviewcl the second lan card is marked down, in which case lan failover will no longer work.

However I have just done a cmscancl -n nodename and it is now marked up. Any ideas why this is?? This means if we have a subnet failure we need to manually intervene after to reset the nic.

Cheers,

Rob
2 REPLIES 2
James R. Ferguson
Acclaimed Contributor

Re: serviceguard marks nic down after subnet failure

Hi Rob:

In the event of a subnet outage, where none of the servers can run a package, the package will remain down. The first node to see the subnet again will adopt or run the package. If all nodes see the subnet return simultaneously, then the primary node will (re)start the package. This scenario seems to be the one you described.

I suspect that there is/was nothing wrong with the cluster failover capabilities in the case you describe. Rather, the 'cmscancl' command simply caused a refresh of the network status.

...JRF...
Carsten Krege
Honored Contributor

Re: serviceguard marks nic down after subnet failure

When I understand it correctly you pulled the lan cables of the primary and the failover lan to force a package failover. After you reconnected the lan cables, the primary lan has been successfully marked "UP", whereas the standby lan kept to be "DOWN".

SG has two means to detect lan failures:

1) the lan driver reports an error
2) no increase of network statistics for 7*NETWORK_POLLING_INTERVAL (for ethernet) for the interface

With NETWORK_POLLING_INTERVAL=2 seconds (default value), it can take up to 16 seconds before the standby lan is marked UP again. Check the NETWORK_POLLING_INTERVAL value on your cluster.

It could also be a problem of the lan driver that it doesn't report the interface as UP again, after cable reconnect. Check that you have the latest patches for your lan driver installed (+ dependent patches).

Carsten
-------------------------------------------------------------------------------------------------
In the beginning the Universe was created. This has made a lot of people very angry and been widely regarded as a bad move. -- HhGttG