1833601 Members
3974 Online
110061 Solutions
New Discussion

Re: NODE_TIMEOUT

 
Jonathan H.
Occasional Advisor

NODE_TIMEOUT

This weekend I had one side of my 2 node cluster TOC- in q4 it suggested that the cause might be that the NODE_TIMEOUT period is to low. It suggeted that I set it at 8 seconds. I currently have 4-875 CPU's in the same cell on one side and on 4-650 CPU's in the same cell on the other side.

Is the above setting correct for my systems configuration. Also should I set the heartbeat_Interval up.
6 REPLIES 6
John Poff
Honored Contributor

Re: NODE_TIMEOUT

Hi,

We have our NODE_TIMEOUT set for 8 seconds and our HEARTBEAT_INTERVAL set for 2 seconds. Those values seem to work well and we haven't had any random TOCs when the network was busy.

JP
A. Clay Stephenson
Acclaimed Contributor

Re: NODE_TIMEOUT

Well, unless I use The Force I have no way of knowing what your current settings are so that makes it a little difficult to make intelligent comments.

I can say that I use a HEARTBEAT_INTERVAL of 1000000 (1 s) and a NODE_TIMEOUT of 8000000 (8 s) and have never had a TOC; of course, I've never had a MC/SG failover in over 5 years that was not manually (and intentionally) triggered.

If you are using the default NODE_TIMEOUT of 2 s, you are really asking for incidents like yours. I do assume you have multiple HEARYBEAT_IP's defined.

If it ain't broke, I can fix that.
Jonathan H.
Occasional Advisor

Re: NODE_TIMEOUT

I have the HEARTBEAT_INTERVAL set at 3000000
and the NODE_TIMEOUT set at 6000000

We are currently running several clusters throughout the country and have never had this problem. Until we upgraded the CPU's on one side. Do your systems have the same size CPU's?
John Poff
Honored Contributor

Re: NODE_TIMEOUT

I don't think it is so much a function of how fast your CPUs are, but the combination of your settings. With HB at 3 seconds and TO at 6 seconds, that means you only have to miss two heartbeats and it is TOC time. Our settings of HB at 2 and TO at 8 means you have to miss 4 heartbeats. With Clay's settings you have to miss 8 heartbeats.

JP
A. Clay Stephenson
Acclaimed Contributor

Re: NODE_TIMEOUT

In your case, you are running the minimum allowed value for NODE_TIMEOUT of 2 X HEARTBEAT_INTERVAL which puts you on the hairy edge eventhough your total timeout (6 seconds) seems reasonable. You are essentially as vulnerable and someone running the absolute minimum of NODE_TIMEOUT = 2 s and HEARTBEAT_INTERVAL of 1 s. The speed of the CPU's should have little to do with this and indeed it is quite common in MS/SG land to have very asymetrical servers making up a cluster especially if old klunkers are used for failover.

My rule (and it's just mine) is to never go below 3 heartbeat misses but obviously I prefer more frequent heartbeats but tolerate more misses.

Finally, just because you (and q4) think this is the reason for the TOC doesn't mean that it is. For example, an operator might have pushed the little button.
If it ain't broke, I can fix that.
Stephen Doud
Honored Contributor

Re: NODE_TIMEOUT

Suggest:
NODE_TIMEOUT = 8 seconds
HEARTBEAT_INTERVAL = 1 second
(sends up to 8 heartbeat packets before NODE_TIMEOUT expires)

Consider:
Create redundant heartbeat paths:
Review the cluster configuration file - look for STATIONARY_IP. If this title is related to an ethernet NIC, change it to HEARTBEAT_IP.
Then, with the cluster down, perform
# cmapplyconf -C

-StephenD.