- Community Home
- >
- Servers and Operating Systems
- >
- Operating Systems
- >
- Operating System - HP-UX
- >
- Re: ideal behaviour of the cluster system
Categories
Company
Local Language
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Forums
Discussions
Discussions
Discussions
Forums
Discussions
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Community
Resources
Forums
Blogs
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-29-2002 08:59 AM
04-29-2002 08:59 AM
Any views will be highly appreciated.
Solved! Go to Solution.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-29-2002 09:06 AM
04-29-2002 09:06 AM
Re: ideal behaviour of the cluster system
Sandip
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-29-2002 09:11 AM
04-29-2002 09:11 AM
Re: ideal behaviour of the cluster system
Generally this behavior can be avoided, or its chances of causing a cluster reformation lessened, if you have a heartbeat over a serial (RS232) connection, too.
Regards!
...JRF...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-29-2002 09:12 AM
04-29-2002 09:12 AM
Re: ideal behaviour of the cluster system
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-29-2002 09:22 AM
04-29-2002 09:22 AM
Re: ideal behaviour of the cluster system
In a 2 node network, I always prefer to have direct connected heartbeats - if at all feasible i.e. in same room/bldg etc. They can be LAN or serial - but the key is direct-connect.
This way network trouble will never "lose" the heartbeat.
If not possible then increasing the timeout is the only option - but note this will slowdown the failover.
Rgds,
Jeff
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-29-2002 09:50 AM
04-29-2002 09:50 AM
Re: ideal behaviour of the cluster system
Some points:
1) Apply the patch - PHSS_26338 (s700_800 11.X MC/ServiceGuard and SG-OPS Edition A.11.09). This has fix for a lot of issues with network cards/heartbeat/MC/SG errors. Read the patch documentation for details. Read the patch warnings too.
2)Check the network card prformance, switches and other devices.
3) Check the network pollings, intervals, load, time-out values.
4) If you have another cluster in the same network, then compare the MC/SG parameters.
5) Apply the latest patches from Custom patch manager.
HTH,
Shiju
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-29-2002 12:19 PM
04-29-2002 12:19 PM
Re: ideal behaviour of the cluster system
If the heartbeats were lost and/or the configured settings are too low, you could see this situation.
If the node that stayed up has a logged message sayoing "obtaining Cluster Lock", then SG did what it is designed to do.
As for a serial heartbet, very unreliable, and is NOT a full heartbeat, I generally recommend against it.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-30-2002 05:00 AM
04-30-2002 05:00 AM
SolutionFrom what I've read, your nodes are interconnected with 2 heartbeat LANs.
Verify this by inspecting the cluster ASCII configuration file, and verify that at least two LANs (per node) are described as HEARTBEAT_IP. This insures a redundant path for heartbeat.
James recommended implementation of the serial HB cable. It is only supported in a 2-node cluster. It won't prevent a node from TOC'ing (dumping core and rebooting), but it insures the node with viable LANs becomes the new cluster coordinator, when the other node's HB LANs all cease to operate.
Typically, the cause of this undesirable occurence is leaving NODE_TIMEOUT set to default 2 seconds (2 million microseconds) in the cluster ASCII file. Though 2 seconds is supported, more often than not, kernel tuning and loading allow a node to do kernel-intensive work long enough to delay heartbeat generation sufficiently to cause a NODE_TIMEOUT and cluster reformation to occur.
syslog.log will report these. Severe enough delays can also result in a node rebooting due to failure to join the newly formed cluster.
Please read this article for more information:
UXSGLVKBAN00000010
Finally, as a matter of courtesy, please consider giving points to the correspondents on this issue.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-30-2002 05:36 AM
04-30-2002 05:36 AM
Re: ideal behaviour of the cluster system
# Cluster Timing Parmeters (microseconds).
HEARTBEAT_INTERVAL 1000000
NODE_TIMEOUT 12000000
With all the replies , i think at present best possible thing to do is to increase the node_timeout value.
but one thing i dont understand is since I had dedicated heartbeat why did the cluster performed an TOC. I can understand that the servers got real busy with the lan cards but still heartbeat is of top importance and it should not have timed out on that .?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-02-2002 05:27 AM
05-02-2002 05:27 AM
Re: ideal behaviour of the cluster system
Since your cluster appears to be configured to handle HB traffic redundantly and within the NODE_TIMEOUT period, a system hang may have occured.
ServiceGuard features a "safety timer" that has the ability to TOC a hung server. When a server hangs, HB transmission from that server ceases, causing a cluster reformation on the remaining active nodes. Since it is likely that another server is configured to take over the hung server's packages (and volume groups), it is necessary to TOC the hung server to prevent data corruption in case it becomes "unhung" later, allowing it to write to disks activated on the failover node.
Check /var/adm/crash for a recent core dump. If one exists, use this document OZBEKBRC00000611 to run the "q4" utility and prepare files for HP to review to determine the nature of the hang.
-s.