1833758 Members
2332 Online
110063 Solutions
New Discussion

mcs fails on night time

 
SOLVED
Go to solution
bhavin asokan
Honored Contributor

mcs fails on night time

hi all,

on my servers for the last two days mcs fails in night.the syslog entry is below.pls tell me what to do ,

first server
---------------
Jul 15 03:25:28 ram cmcld: timers delayed 3.24 seconds
Jul 15 03:25:28 ram cmcld: Warning: cmcld process was unable to run for the last 3 seconds
Jul 20 00:27:51 ram cmcld: timers delayed 2.74 seconds
Jul 21 02:03:23 ram cmcld: Timed out node laxman. It may have failed.
Jul 21 02:03:23 ram cmcld: Attempting to adjust cluster membership
Jul 21 02:03:32 ram cmcld: Obtaining First Dual Cluster Lock
Jul 21 02:03:33 ram cmcld: Obtaining Second Dual Cluster Lock
Jul 21 02:03:33 ram cmcld: Communication attempt to node laxman did not succeed
Jul 21 02:03:34 ram cmcld: Turning off safety time protection since the cluster
Jul 21 02:03:34 ram cmcld: may now consist of a single node. If ServiceGuard
Jul 21 02:03:34 ram cmcld: fails, this node will not automatically halt
Jul 21 02:03:49 ram cmcld: Enabling safety time protection
Jul 21 02:03:49 ram cmcld: Attempting to adjust cluster membership
Jul 21 02:03:49 ram cmcld: Clearing First Dual Cluster Lock
Jul 21 02:03:50 ram cmcld: Clearing Second Dual Cluster Lock
Jul 21 02:03:51 ram cmcld: Resumed updating safety time
Jul 21 02:03:51 ram cmcld: 2 nodes have formed a new cluster, sequence #3
Jul 21 02:03:51 ram cmcld: The new active cluster membership is: ram(id=1), laxman(id=2)
Jul 22 03:29:52 ram cmcld: Warning: cmcld process was unable to run for the last 16 seconds,
Jul 22 03:29:52 ram cmcld: which is longer than the node timeout (8 seconds)
Jul 22 03:29:52 ram cmcld: Timed out node laxman. It may have failed.
Jul 22 03:29:52 ram cmcld: Attempting to adjust cluster membership
Jul 22 03:30:01 ram cmcld: Obtaining First Dual Cluster Lock
Jul 22 03:30:02 ram cmcld: First Cluster lock was denied. Lock was obtained by another node.
Jul 22 03:30:02 ram cmcld: Attempting to form a new cluster
Jul 22 03:30:03 ram cmcld: Resumed updating safety time
Jul 22 03:30:04 ram cmcld: 2 nodes have formed a new cluster, sequence #5
Jul 22 03:30:04 ram cmcld: The new active cluster membership is: laxman(id=2), ram(id=1)
Jul 22 03:30:02 ram cmcld: Attempting to adjust cluster membership


second server
---------------

Jul 21 02:04:58 laxman cmcld: Communication to node ram has been interrupted
Jul 21 02:04:58 laxman cmcld: Node ram may have died
Jul 21 02:04:58 laxman cmcld: Attempting to form a new cluster
Jul 21 02:05:07 laxman cmcld: Obtaining First Dual Cluster Lock
Jul 21 02:05:08 laxman cmcld: First Cluster lock was denied. Lock was obtained by another node.
Jul 21 02:05:24 laxman cmcld: Heartbeat connection attempt to node ram timed out
Jul 21 02:05:24 laxman cmcld: Attempting to adjust cluster membership
Jul 21 02:05:25 laxman cmcld: Resumed updating safety time
Jul 21 02:05:26 laxman cmcld: 2 nodes have formed a new cluster, sequence #3
Jul 21 02:05:26 laxman cmcld: The new active cluster membership is: ram(id=1), laxman(id=2)
Jul 21 02:05:24 laxman cmcld: Attempting to form a new cluster
Jul 22 03:31:21 laxman cmcld: Timed out node ram. It may have failed.
Jul 22 03:31:21 laxman cmcld: Attempting to form a new cluster
Jul 22 03:31:30 laxman cmcld: Obtaining First Dual Cluster Lock
Jul 22 03:31:31 laxman cmcld: Obtaining Second Dual Cluster Lock
Jul 22 03:31:32 laxman cmcld: Turning off safety time protection since the cluster
Jul 22 03:31:32 laxman cmcld: may now consist of a single node. If ServiceGuard
Jul 22 03:31:32 laxman cmcld: fails, this node will not automatically halt
Jul 22 03:31:39 laxman cmcld: Enabling safety time protection
Jul 22 03:31:39 laxman cmcld: Attempting to adjust cluster membership
Jul 22 03:31:39 laxman cmcld: Clearing First Dual Cluster Lock
Jul 22 03:31:40 laxman cmcld: Clearing Second Dual Cluster Lock
Jul 22 03:31:41 laxman cmcld: Resumed updating safety time
Jul 22 03:31:41 laxman cmcld: 2 nodes have formed a new cluster, sequence #5
Jul 22 03:31:41 laxman cmcld: The new active cluster membership is: laxman(id=2), ram(id=1)

pls tell me how i will be able to trace the problem

regds,


5 REPLIES 5
RAC_1
Honored Contributor

Re: mcs fails on night time

I strongly doubt that your have problems with heartbeat communication bbetween nodes. Seems that in your case both nodes
(ram and laxman--Where is sita then??)
are not able to talk to each other and ram trying to get cluster lock disk and failing.

Finally laxman gets the lock disk and cluster is formed.

Check if you have any problems with heartbeat.

Anil
There is no substitute to HARDWORK
Stephen Doud
Honored Contributor
Solution

Re: mcs fails on night time

This issue has been patched:
11.09 - PHSS_23511
11.12 - PHSS_23373
10.12 - PHSS_22870

What version of SG does your system run??

use this to determine:
# what /usr/lbin/cmcld | grep Date

-sd-
bhavin asokan
Honored Contributor

Re: mcs fails on night time

i am having A.11.14

regds
Steve Lewis
Honored Contributor

Re: mcs fails on night time

It may also be caused by having non-redundant communication between nodes, only one heartbeat interface, effectively a single point of failure. When that lan goes down so does your cluster.

I once saw a problem where the network provider kept taking lines down in the middle of the night and not telling anyone about it.

If you have the heartbeat going across only one lan interface, then add it to a
second interface in your cluster ascii file, then re-check and apply the cluster.

If you have only one lan interface, then that is a SPOF and it is pointless running serviceguard anyway, unless you use a serial crossover heartbeat.
Stephen Doud
Honored Contributor

Re: mcs fails on night time

PHSS_29561 was the first patch for A.11.14 that addressed "timers delayed" issue:

14. The cmcld daemon may log the message "timers delayed
x.x seconds" due to kernel latency issues, or a network
partition may separate nodes in the cluster. A
ServiceGuard cluster of more than 2 nodes with a
cluster lock, after experiencing such a hang or
partition, may result in the formation of 2 clusters.
This is a corner case where the hang or partition
happens while a node is joining a previously formed 2-
node cluster. The joining node forms a cluster with
the original coordinator node, while the
non-coordinator node forms a cluster by itself.

The current version of the patch is PHSS_31015.

The problem might also be due to a default NODE_TIMEOUT value.
Use this command to determine what the NODE_TIMEOUT value is set to:
# cmviewconf | grep "node timeout"

If it is 2 seconds, adjust it to 6-8 seconds!
You can do that by editting the cluster configuration ASCII file normally saved by the admin on one of the servers in /etc/cmcluster. If you can't find it, you can reconstitute it from the cluster binary by using this command:

# cmgetconf cluster.ascii

Once it's recreated - check it's validity against the current cluster environment:

# cmcheckconf -C cluster.ascii

If this succeeds, edit the NODE_TIMEOUT value up to 6-8 seconds, then halt the cluster when you have the opportunity (cmhaltcl -f) and update the cluster binary:
# cmapplyconf -f -C cluster.ascii

-StephenD.