1836987 Members
2262 Online
110111 Solutions
New Discussion

Re: MC/SG messages..

 
SOLVED
Go to solution
kapsoo kim
Occasional Contributor

MC/SG messages..

Hi all ~~
I had a messages from my system..

=========SYSTEM1=========
cmcld: Communication to node SYSTEM2 has been interrupted
cmcld: node SYSTEM2 may have died
cmcld: Attempting to form a new cluster
cmcld: timers delayed 295.18 seconds
cmcld: Warning: cmcld process was unable to run for the last 295 seconds
cmcld: Resumed updating safety time
cmcld: 2 nodes have formed a new cluster, sequence #2
cmcld: timers delayed 295.18 seconds
cmcld: The new active cluster membership is: SYSTEM2(id=1), SYSTEM1(id=2)

=========SYSTEM2=========
cmcld: Timed out node SYSTEM1. It may have failed.
cmcld: Attempting to adjust cluster membership
cmcld: Clearing Cluster Lock
cmcld: Resumed updating safety time
cmcld: 2 nodes have formed a new cluster, sequence #2
cmcld: The new active cluster membership is: SYSTEM2(id=1), SYSTEM1(id=2)


What happen the systems~~?
thank you..
3 REPLIES 3
John Poff
Honored Contributor
Solution

Re: MC/SG messages..

Hello,

My guess is that the cmcld daemon on SYSTEM1 got blocked and couldn't run for 295 seconds, which is a very long time! Possibly SYSTEM1 was extremely busy for a few minutes? The other system didn't attempt to take over the cluster, so it must have been seeing the heartbeat packets from the first system.

HP does recommend setting the node timeout for the cluster up to 6 to 8 seconds; the default is (or used to be) 2 seconds. I'm not sure that would have made a difference in this case. I used to have lots of problems with a three node cluster reforming many times each day, but raising the node timeout value solved that problem. I don't recall seeing the error about the cmcld not responding.

I'd check SYSTEM1 and try to figure out what it was doing during the time that the cmcld daemon was not responding.

JP
melvyn burnard
Honored Contributor

Re: MC/SG messages..

you do not say which version of ServiceGuard you are using, but I would recommend you have the latest patch installed for the version on your systems.
My house is the bank's, my money the wife's, But my opinions belong to me, not HP!
Carsten Krege
Honored Contributor

Re: MC/SG messages..

You definitely dealt with a system hang on system1. You need to contact your local HP Response Center to check your patch level. It is likely that some important kernel patches are outdated or missing on your machine. It is also possible to TOC the machine during a hang period and to analyze the resulting dump afterwards. In the majority of cases we are able to deduce the root cause from a hung system's crash dump.

From the timing I see in the syslog, it also appears that the NODE_TIMEOUT parameter is much higher than the recommended value (5-8 seconds). When the patches are applied you should also change these.

Carsten
-------------------------------------------------------------------------------------------------
In the beginning the Universe was created. This has made a lot of people very angry and been widely regarded as a bad move. -- HhGttG