MC/SG messages..

kapsoo kim · ‎06-19-2001

Hi all ~~
I had a messages from my system..

=========SYSTEM1=========
cmcld: Communication to node SYSTEM2 has been interrupted
cmcld: node SYSTEM2 may have died
cmcld: Attempting to form a new cluster
cmcld: timers delayed 295.18 seconds
cmcld: Warning: cmcld process was unable to run for the last 295 seconds
cmcld: Resumed updating safety time
cmcld: 2 nodes have formed a new cluster, sequence #2
cmcld: timers delayed 295.18 seconds
cmcld: The new active cluster membership is: SYSTEM2(id=1), SYSTEM1(id=2)

=========SYSTEM2=========
cmcld: Timed out node SYSTEM1. It may have failed.
cmcld: Attempting to adjust cluster membership
cmcld: Clearing Cluster Lock
cmcld: Resumed updating safety time
cmcld: 2 nodes have formed a new cluster, sequence #2
cmcld: The new active cluster membership is: SYSTEM2(id=1), SYSTEM1(id=2)

What happen the systems~~?
thank you..

John Poff · ‎06-19-2001

Hello,

My guess is that the cmcld daemon on SYSTEM1 got blocked and couldn't run for 295 seconds, which is a very long time! Possibly SYSTEM1 was extremely busy for a few minutes? The other system didn't attempt to take over the cluster, so it must have been seeing the heartbeat packets from the first system.

HP does recommend setting the node timeout for the cluster up to 6 to 8 seconds; the default is (or used to be) 2 seconds. I'm not sure that would have made a difference in this case. I used to have lots of problems with a three node cluster reforming many times each day, but raising the node timeout value solved that problem. I don't recall seeing the error about the cmcld not responding.

I'd check SYSTEM1 and try to figure out what it was doing during the time that the cmcld daemon was not responding.

JP

melvyn burnard · ‎06-19-2001

you do not say which version of ServiceGuard you are using, but I would recommend you have the latest patch installed for the version on your systems.

My house is the bank's, my money the wife's, But my opinions belong to me, not HP!

Carsten Krege · ‎06-19-2001

You definitely dealt with a system hang on system1. You need to contact your local HP Response Center to check your patch level. It is likely that some important kernel patches are outdated or missing on your machine. It is also possible to TOC the machine during a hang period and to analyze the resulting dump afterwards. In the majority of cases we are able to deduce the root cause from a hung system's crash dump.

From the timing I see in the syslog, it also appears that the NODE_TIMEOUT parameter is much higher than the recommended value (5-8 seconds). When the patches are applied you should also change these.

Carsten

-------------------------------------------------------------------------------------------------
In the beginning the Universe was created. This has made a lot of people very angry and been widely regarded as a bad move. -- HhGttG

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

MC/SG messages..

MC/SG messages..

Re: MC/SG messages..

Re: MC/SG messages..

Re: MC/SG messages..