1846372 Members
4467 Online
110256 Solutions
New Discussion

SG problem?

 
Boonchu Ngampairoijpibu_1
Occasional Contributor

SG problem?

I got a bunch of messages from syslog in every 30 mins.

Jun 8 08:12:49 cmcld[2549]: Attempting to adjust cluster membership

Jun 8 08:12:52 cmcld[2549]: Obtaining Cluster Lock

Jun 8 08:12:53 cmcld[2549]: Turning off safety time protection since the cluster

Jun 8 08:12:53 cmcld[2549]: now consists of a single node. If ServiceGuard

Jun 8 08:12:53 cmcld[2549]: fails, this node will not automatically halt


This is my configuration. I have one pri lan, one standby lan, and another to be crossover lan between two nodes. I brought pri lan to be a heartbeat also. I have do nothing to crossover lan.

flags: 12 (single cluster lock)
heartbeat interval: 1.00 (seconds)
node timeout: 2.00 (seconds)
heartbeat connection timeout: 4.00 (seconds)
auto start timeout: 600.00 (seconds)
network polling interval: 2.00 (seconds)

Someone told me to tune up node timeout from 2 seconds to 5+ seconds. I agreed with this point that will elimiate the syslog message. However, I would like to configure the crossover lan cable to have an another heartbeat running. Is it possible? If so, I still want to keep node timeout to 2 second, and add crossover to have an second heartbeat, does it eliminate the problem?

SG experts, pl let me know.

Boonchu Ngampairoijpibul
Boonchu Ngampairoijpibul
2 REPLIES 2
James R. Ferguson
Acclaimed Contributor

Re: SG problem?

Hi:

The general guideline is to keep the NODE_TIMEOUT at, or slightly above 5-seconds. This gives the 'cmcld' daemon reasonable assurance of getting processor cycles to accomodate its needs.

...JRF...
Carsten Krege
Honored Contributor

Re: SG problem?

The widely agreed opinion is that NODE_TIMEOUT should be in the range between 5-8 seconds. This is a general statement which is independent from you SG configuration. The reason why we recommend this, is that we feel that this setting will effectively avoid cluster reformations triggered by short hic-ups of the system (system hangs that starve out SG's main process cmcld from getting CPU), but still guarantees a short failover time.

We do not recommend to increase the NODE_TIMEOUT beyond 8s, even if the cluster still runs into reformations. In this case it is definitely necessary to identify the root cause of the problem.

From the messages you get, we cannot deduce the cause of the cluster reformation. Basically we deal with the following problems:

1) network problems: cmcld doesn't receive heartbeats and can therefore not update the safety timer
2) system hangs and related: cmcld doesn't get CPU time to update the safety timer

Adding a private heartbeat network (Yes, crossover cables ARE supported!) helps for problems of the category 1 only.

If cmcld is not getting CPU time (category 2), the new heartbeat network will not help.

My recommendation is to do both: Adding the crossover lan and to increase NODE_TIMEOUT.

Carsten
-------------------------------------------------------------------------------------------------
In the beginning the Universe was created. This has made a lot of people very angry and been widely regarded as a bad move. -- HhGttG