- Community Home
- >
- Servers and Operating Systems
- >
- Operating Systems
- >
- Operating System - HP-UX
- >
- Re: Funny heartbeat situation
Operating System - HP-UX
1755164
Members
3282
Online
108830
Solutions
Forums
Categories
Company
Local Language
юдл
back
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Discussions
Discussions
Forums
Forums
Discussions
юдл
back
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Blogs
Information
Community
Resources
Community Language
Language
Forums
Blogs
Go to solution
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО04-04-2006 04:36 AM
тАО04-04-2006 04:36 AM
Hi all! About a week ago, we had a rather unique situation occur on a cluster we maintain. The box that was currently running the production application had some kind of network issue in that the NIC couldn't communicate. (A reboot brought it back to life.)
While the NIC was wigging out, access to the production app was completely cut off. The problem is, the secondary box still had an active crossover-cable heartbeat, so it thought the primary was fine, and made no effort to take the package away. So, what should have been a 15 minute outage to fail over the package turned into a 3 hour affair while the escalations reached us and we logged in to see what was going on.
My question is: is there any way to make Serviceguard recognize a network failure on one box and move the package to the other? I'd imagine that to coordinate, the boxes would still need their heartbeat connection (otherwise the package wouldn't know to stop on the primary and the secondary box would never get the SCSI lock). I think there's a way to do it, but the docs are a bit hazy on the subject (or I'm a bit hazy in understanding them).
Any ideas? THANKS!
While the NIC was wigging out, access to the production app was completely cut off. The problem is, the secondary box still had an active crossover-cable heartbeat, so it thought the primary was fine, and made no effort to take the package away. So, what should have been a 15 minute outage to fail over the package turned into a 3 hour affair while the escalations reached us and we logged in to see what was going on.
My question is: is there any way to make Serviceguard recognize a network failure on one box and move the package to the other? I'd imagine that to coordinate, the boxes would still need their heartbeat connection (otherwise the package wouldn't know to stop on the primary and the secondary box would never get the SCSI lock). I think there's a way to do it, but the docs are a bit hazy on the subject (or I'm a bit hazy in understanding them).
Any ideas? THANKS!
Solved! Go to Solution.
3 REPLIES 3
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО04-04-2006 04:44 AM
тАО04-04-2006 04:44 AM
Re: Funny heartbeat situation
Actually this should not have been a package failover to another node but rather simply a switch to a standby LAN on the same node. Did you have a standby LAN connection?
If it ain't broke, I can fix that.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО04-04-2006 05:01 AM
тАО04-04-2006 05:01 AM
Solution
Some customers ignore the SUBNET parameter in the package configuration file. This makes Serviceguard blind to the relationship between a failed network and the package.
To remedy this, include network dependencies (SUBNET references) in the package configuration file and cmapplyconf the config file with the package down.
Additional info:
Serviceguard checks NIC status at the link level. On-the-fly tcp/ip configuration changes are not seen as a "down" NIC by Serviceguard.
To remedy this, include network dependencies (SUBNET references) in the package configuration file and cmapplyconf the config file with the package down.
Additional info:
Serviceguard checks NIC status at the link level. On-the-fly tcp/ip configuration changes are not seen as a "down" NIC by Serviceguard.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО04-04-2006 08:20 PM
тАО04-04-2006 08:20 PM
Re: Funny heartbeat situation
Unfortunately it really depends on exactly how the lan failed so it's difficult to advise.
If the lan failed at the hardware level so that it was unable to send and receive link level data then Serviceguard would have detected this and would have marked the lan down. If there was a standby lan then the IP stack would have been moved over and you would not have noticed any downtime at all. If there was no standby then the subnet would have been marked down since there would have been no working lan to run the subnet. In this case if the package had been defined to monitor this subnet, with the SUBNET keyword in the package files, then Serviceguard would have automatically moved the package over to a node which had this subnet available.
I would suggest that you should probably ensure both of these are setup to minimise downtime. Firstly a standby lan to take over in case of trouble, and secondly a monitored subnet so the package is moved should the subnet totally fail.
However, there are situations where this still is not enough. It is possible for problems to occur such that the card appears to be working, i.e. it can send and receive link level messages, but is unable to send/receive IP level messages. This could happen if there were a problem at the transport driver level rather than at the hardware level. This situation is much harder to handle since Servieguard thinks the lan is working so does not perform a lan switch (which might make no difference anyway if the problem is at the driver level) nor does it mark the subnet down. Therefore the package continues to run. There is no easy way to configure things to handle this situation since Serviceguard is not designed to monitor IP level connectivity. Although this situation is rare, unfortunately, unless you are going to manually add services to your packages to do IP level checking, the best way to handle this is for an operator to manually move the package should this occur.
If the lan failed at the hardware level so that it was unable to send and receive link level data then Serviceguard would have detected this and would have marked the lan down. If there was a standby lan then the IP stack would have been moved over and you would not have noticed any downtime at all. If there was no standby then the subnet would have been marked down since there would have been no working lan to run the subnet. In this case if the package had been defined to monitor this subnet, with the SUBNET keyword in the package files, then Serviceguard would have automatically moved the package over to a node which had this subnet available.
I would suggest that you should probably ensure both of these are setup to minimise downtime. Firstly a standby lan to take over in case of trouble, and secondly a monitored subnet so the package is moved should the subnet totally fail.
However, there are situations where this still is not enough. It is possible for problems to occur such that the card appears to be working, i.e. it can send and receive link level messages, but is unable to send/receive IP level messages. This could happen if there were a problem at the transport driver level rather than at the hardware level. This situation is much harder to handle since Servieguard thinks the lan is working so does not perform a lan switch (which might make no difference anyway if the problem is at the driver level) nor does it mark the subnet down. Therefore the package continues to run. There is no easy way to configure things to handle this situation since Serviceguard is not designed to monitor IP level connectivity. Although this situation is rare, unfortunately, unless you are going to manually add services to your packages to do IP level checking, the best way to handle this is for an operator to manually move the package should this occur.
The opinions expressed above are the personal opinions of the authors, not of Hewlett Packard Enterprise. By using this site, you accept the Terms of Use and Rules of Participation.
News and Events
Support
© Copyright 2024 Hewlett Packard Enterprise Development LP