topic Re: NODE_TIMEOUT in Operating System - HP-UX

NODE_TIMEOUT

Jonathan H. — Wed, 01 Dec 2004 11:13:41 GMT

This weekend I had one side of my 2 node cluster TOC- in q4 it suggested that the cause might be that the NODE_TIMEOUT period is to low. It suggeted that I set it at 8 seconds. I currently have 4-875 CPU's in the same cell on one side and on 4-650 CPU's in the same cell on the other side.

Is the above setting correct for my systems configuration. Also should I set the heartbeat_Interval up.

Re: NODE_TIMEOUT

John Poff — Wed, 01 Dec 2004 11:21:14 GMT

Hi,

We have our NODE_TIMEOUT set for 8 seconds and our HEARTBEAT_INTERVAL set for 2 seconds. Those values seem to work well and we haven't had any random TOCs when the network was busy.

JP

Re: NODE_TIMEOUT

A. Clay Stephenson — Wed, 01 Dec 2004 11:23:50 GMT

Well, unless I use The Force I have no way of knowing what your current settings are so that makes it a little difficult to make intelligent comments.

I can say that I use a HEARTBEAT_INTERVAL of 1000000 (1 s) and a NODE_TIMEOUT of 8000000 (8 s) and have never had a TOC; of course, I've never had a MC/SG failover in over 5 years that was not manually (and intentionally) triggered.

If you are using the default NODE_TIMEOUT of 2 s, you are really asking for incidents like yours. I do assume you have multiple HEARYBEAT_IP's defined.

Re: NODE_TIMEOUT

Jonathan H. — Wed, 01 Dec 2004 11:34:19 GMT

I have the HEARTBEAT_INTERVAL set at 3000000
and the NODE_TIMEOUT set at 6000000

We are currently running several clusters throughout the country and have never had this problem. Until we upgraded the CPU's on one side. Do your systems have the same size CPU's?

Re: NODE_TIMEOUT

John Poff — Wed, 01 Dec 2004 11:37:59 GMT

I don't think it is so much a function of how fast your CPUs are, but the combination of your settings. With HB at 3 seconds and TO at 6 seconds, that means you only have to miss two heartbeats and it is TOC time. Our settings of HB at 2 and TO at 8 means you have to miss 4 heartbeats. With Clay's settings you have to miss 8 heartbeats.

JP

Re: NODE_TIMEOUT

A. Clay Stephenson — Wed, 01 Dec 2004 12:00:06 GMT

In your case, you are running the minimum allowed value for NODE_TIMEOUT of 2 X HEARTBEAT_INTERVAL which puts you on the hairy edge eventhough your total timeout (6 seconds) seems reasonable. You are essentially as vulnerable and someone running the absolute minimum of NODE_TIMEOUT = 2 s and HEARTBEAT_INTERVAL of 1 s. The speed of the CPU's should have little to do with this and indeed it is quite common in MS/SG land to have very asymetrical servers making up a cluster especially if old klunkers are used for failover.

My rule (and it's just mine) is to never go below 3 heartbeat misses but obviously I prefer more frequent heartbeats but tolerate more misses.

Finally, just because you (and q4) think this is the reason for the TOC doesn't mean that it is. For example, an operator might have pushed the little button.

Re: NODE_TIMEOUT

Stephen Doud — Thu, 02 Dec 2004 09:26:14 GMT

Suggest:
NODE_TIMEOUT = 8 seconds
HEARTBEAT_INTERVAL = 1 second
(sends up to 8 heartbeat packets before NODE_TIMEOUT expires)

Consider:
Create redundant heartbeat paths:
Review the cluster configuration file - look for STATIONARY_IP. If this title is related to an ethernet NIC, change it to HEARTBEAT_IP.
Then, with the cluster down, perform
# cmapplyconf -C

-StephenD.