Operating System - HP-UX
1755164 Members
3282 Online
108830 Solutions
New Discussion юеВ

Re: Funny heartbeat situation

 
SOLVED
Go to solution
Matt Hearn
Regular Advisor

Funny heartbeat situation

Hi all! About a week ago, we had a rather unique situation occur on a cluster we maintain. The box that was currently running the production application had some kind of network issue in that the NIC couldn't communicate. (A reboot brought it back to life.)

While the NIC was wigging out, access to the production app was completely cut off. The problem is, the secondary box still had an active crossover-cable heartbeat, so it thought the primary was fine, and made no effort to take the package away. So, what should have been a 15 minute outage to fail over the package turned into a 3 hour affair while the escalations reached us and we logged in to see what was going on.

My question is: is there any way to make Serviceguard recognize a network failure on one box and move the package to the other? I'd imagine that to coordinate, the boxes would still need their heartbeat connection (otherwise the package wouldn't know to stop on the primary and the secondary box would never get the SCSI lock). I think there's a way to do it, but the docs are a bit hazy on the subject (or I'm a bit hazy in understanding them).

Any ideas? THANKS!
3 REPLIES 3
A. Clay Stephenson
Acclaimed Contributor

Re: Funny heartbeat situation

Actually this should not have been a package failover to another node but rather simply a switch to a standby LAN on the same node. Did you have a standby LAN connection?
If it ain't broke, I can fix that.
Stephen Doud
Honored Contributor
Solution

Re: Funny heartbeat situation

Some customers ignore the SUBNET parameter in the package configuration file. This makes Serviceguard blind to the relationship between a failed network and the package.
To remedy this, include network dependencies (SUBNET references) in the package configuration file and cmapplyconf the config file with the package down.

Additional info:
Serviceguard checks NIC status at the link level. On-the-fly tcp/ip configuration changes are not seen as a "down" NIC by Serviceguard.
John Bigg
Esteemed Contributor

Re: Funny heartbeat situation

Unfortunately it really depends on exactly how the lan failed so it's difficult to advise.

If the lan failed at the hardware level so that it was unable to send and receive link level data then Serviceguard would have detected this and would have marked the lan down. If there was a standby lan then the IP stack would have been moved over and you would not have noticed any downtime at all. If there was no standby then the subnet would have been marked down since there would have been no working lan to run the subnet. In this case if the package had been defined to monitor this subnet, with the SUBNET keyword in the package files, then Serviceguard would have automatically moved the package over to a node which had this subnet available.

I would suggest that you should probably ensure both of these are setup to minimise downtime. Firstly a standby lan to take over in case of trouble, and secondly a monitored subnet so the package is moved should the subnet totally fail.

However, there are situations where this still is not enough. It is possible for problems to occur such that the card appears to be working, i.e. it can send and receive link level messages, but is unable to send/receive IP level messages. This could happen if there were a problem at the transport driver level rather than at the hardware level. This situation is much harder to handle since Servieguard thinks the lan is working so does not perform a lan switch (which might make no difference anyway if the problem is at the driver level) nor does it mark the subnet down. Therefore the package continues to run. There is no easy way to configure things to handle this situation since Serviceguard is not designed to monitor IP level connectivity. Although this situation is rare, unfortunately, unless you are going to manually add services to your packages to do IP level checking, the best way to handle this is for an operator to manually move the package should this occur.