Operating System - HP-UX
1825782 Members
2112 Online
109687 Solutions
New Discussion

Re: ideal behaviour of the cluster system

 
SOLVED
Go to solution
ashish nanjiani
Frequent Advisor

ideal behaviour of the cluster system

We have two superdomes connected via service guard. Last week to an network outage the servers lost their connectivity to the outside world as both the redundant lan cards in each of the servers failed to recieve any signals . Apparantly this caused one of the servers doing a panic. Support said that this was caused due to system being too busy with the lan cards and it failed to keep the hearbeat intact and the hearbeat timed out. According to me this is not a normal behaviour and under the above circumstances the servers should have remain up and would have flashed error messages about the cards.

Any views will be highly appreciated.

9 REPLIES 9
Sandip Ghosh
Honored Contributor

Re: ideal behaviour of the cluster system

You do not have any separate Lan Card for the Heart Beat? You can increase the heartbeat sequence to 10 sec to avoid this type of problems. By default it is 2 sec only.

Sandip
Good Luck!!!
James R. Ferguson
Acclaimed Contributor

Re: ideal behaviour of the cluster system

Hi:

Generally this behavior can be avoided, or its chances of causing a cluster reformation lessened, if you have a heartbeat over a serial (RS232) connection, too.

Regards!

...JRF...
ashish nanjiani
Frequent Advisor

Re: ideal behaviour of the cluster system

I have two seperate lan cards for heartbeat.
Jeff Schussele
Honored Contributor

Re: ideal behaviour of the cluster system

Hi ashish,

In a 2 node network, I always prefer to have direct connected heartbeats - if at all feasible i.e. in same room/bldg etc. They can be LAN or serial - but the key is direct-connect.
This way network trouble will never "lose" the heartbeat.

If not possible then increasing the timeout is the only option - but note this will slowdown the failover.

Rgds,
Jeff
PERSEVERANCE -- Remember, whatever does not kill you only makes you stronger!
Helen French
Honored Contributor

Re: ideal behaviour of the cluster system

Hi Ashish:

Some points:

1) Apply the patch - PHSS_26338 (s700_800 11.X MC/ServiceGuard and SG-OPS Edition A.11.09). This has fix for a lot of issues with network cards/heartbeat/MC/SG errors. Read the patch documentation for details. Read the patch warnings too.

2)Check the network card prformance, switches and other devices.

3) Check the network pollings, intervals, load, time-out values.

4) If you have another cluster in the same network, then compare the MC/SG parameters.

5) Apply the latest patches from Custom patch manager.

HTH,
Shiju
Life is a promise, fulfill it!
melvyn burnard
Honored Contributor

Re: ideal behaviour of the cluster system

without knowing your configuration, and without the exact details of what lost contact, it is not easy to give an concrete answer, but I suspect that the cluster behaved exactly as planned.
If the heartbeats were lost and/or the configured settings are too low, you could see this situation.
If the node that stayed up has a logged message sayoing "obtaining Cluster Lock", then SG did what it is designed to do.
As for a serial heartbet, very unreliable, and is NOT a full heartbeat, I generally recommend against it.
My house is the bank's, my money the wife's, But my opinions belong to me, not HP!
Stephen Doud
Honored Contributor
Solution

Re: ideal behaviour of the cluster system

Ashish,
From what I've read, your nodes are interconnected with 2 heartbeat LANs.
Verify this by inspecting the cluster ASCII configuration file, and verify that at least two LANs (per node) are described as HEARTBEAT_IP. This insures a redundant path for heartbeat.

James recommended implementation of the serial HB cable. It is only supported in a 2-node cluster. It won't prevent a node from TOC'ing (dumping core and rebooting), but it insures the node with viable LANs becomes the new cluster coordinator, when the other node's HB LANs all cease to operate.

Typically, the cause of this undesirable occurence is leaving NODE_TIMEOUT set to default 2 seconds (2 million microseconds) in the cluster ASCII file. Though 2 seconds is supported, more often than not, kernel tuning and loading allow a node to do kernel-intensive work long enough to delay heartbeat generation sufficiently to cause a NODE_TIMEOUT and cluster reformation to occur.
syslog.log will report these. Severe enough delays can also result in a node rebooting due to failure to join the newly formed cluster.

Please read this article for more information:
UXSGLVKBAN00000010

Finally, as a matter of courtesy, please consider giving points to the correspondents on this issue.
ashish nanjiani
Frequent Advisor

Re: ideal behaviour of the cluster system

thanks for all you help. The exact configuration I have has one dedicated las card for heart beat and the another redundant one shared over with the data lan. The timeout values are
# Cluster Timing Parmeters (microseconds).

HEARTBEAT_INTERVAL 1000000
NODE_TIMEOUT 12000000

With all the replies , i think at present best possible thing to do is to increase the node_timeout value.

but one thing i dont understand is since I had dedicated heartbeat why did the cluster performed an TOC. I can understand that the servers got real busy with the lan cards but still heartbeat is of top importance and it should not have timed out on that .?


Stephen Doud
Honored Contributor

Re: ideal behaviour of the cluster system

Hello Asish,

Since your cluster appears to be configured to handle HB traffic redundantly and within the NODE_TIMEOUT period, a system hang may have occured.

ServiceGuard features a "safety timer" that has the ability to TOC a hung server. When a server hangs, HB transmission from that server ceases, causing a cluster reformation on the remaining active nodes. Since it is likely that another server is configured to take over the hung server's packages (and volume groups), it is necessary to TOC the hung server to prevent data corruption in case it becomes "unhung" later, allowing it to write to disks activated on the failover node.

Check /var/adm/crash for a recent core dump. If one exists, use this document OZBEKBRC00000611 to run the "q4" utility and prepare files for HP to review to determine the nature of the hang.

-s.