Operating System - HP-UX
1824363 Members
3266 Online
109669 Solutions
New Discussion юеВ

Serviceguard and cmcld problem

 
SOLVED
Go to solution
Colin D. Carruthers
Occasional Advisor

Serviceguard and cmcld problem

Hello All,

I have't posted on here before, so please bear with me if I miss something important.
On an A500 running 11.11 and SG A.11.14 and PHSS_27246 we received to following messages in syslog.log followed by a crash or TOC, and a successful failover to serv8. But I'm puzzled by the initial crash/TOC on serv7.

Oct 13 19:45:45 serv7 cmcld: Warning: cmcld process was unable to run for the last 5 seconds
Oct 13 19:46:11 serv7 cmcld: Warning: cmcld process was unable to run for the last 22 seconds,
Oct 13 19:46:11 serv7 cmcld: which is longer than the node timeout (10 seconds)
Oct 13 19:46:11 serv7 cmcld: Communication to node serv8 has been interrupted
Oct 13 19:46:11 serv7 cmcld: Node serv8 may have died
Oct 13 19:46:11 serv7 cmcld: Attempting to form a new cluster
Oct 13 19:46:17 serv7 cmcld: Attempting to adjust cluster membership
Oct 13 19:46:22 serv7 cmcld: Warning: cmcld process was unable to run for the last 4 seconds
Oct 13 19:46:13 serv7 cmcld: Communication to node serv8 has been interrupted
Oct 13 19:46:22 serv7 cmcld: Resumed updating safety time
Oct 13 19:46:13 serv7 cmcld: Attempting to form a new cluster
Oct 13 19:46:22 serv7 cmcld: 2 nodes have formed a new cluster, sequence #15
Oct 13 19:46:22 serv7 cmcld: The new active cluster membership is: serv8(id=2), serv7(id=1)
Oct 13 19:46:39 serv7 cmcld: Warning: cmcld process was unable to run for the last 3 seconds
Oct 13 19:51:09 serv7 automountd[858]: caenfs1:/export/admin/misc/scripts server not responding: RPC: Timed out
Oct 13 19:49:26 serv7 cmcld: Warning: cmcld process was unable to run for the last 3 seconds
Oct 13 19:46:32 serv7 cmcld: Warning: cmcld process was unable to run for the last 4 seconds
Oct 13 19:51:40 serv7 above message repeats 2 times
Oct 13 19:52:35 serv7 cmcld: Warning: cmcld process was unable to run for the last 3 seconds
Oct 13 20:02:57 serv7 cmcld: Warning: cmcld process was unable to run for the last 25 seconds,
Oct 13 20:02:57 serv7 cmcld: which is longer than the node timeout (10 seconds)
Oct 13 20:02:57 serv7 cmcld: WARNING: In the last hour, the ServiceGuard daemon
Oct 13 20:02:57 serv7 cmcld: experienced 3 short OS hangs of 5 or more seconds.
Oct 13 20:02:57 serv7 cmcld: Multiple short hangs or a longer single hang could
Oct 13 20:02:58 serv7 cmcld: lead to a system TOC.
Oct 13 20:02:58 serv7 cmcld: Communication to node serv8 has been interrupted
Oct 13 20:02:58 serv7 cmcld: Node serv8 may have died
Oct 13 20:02:58 serv7 cmcld: Attempting to form a new cluster
Oct 13 20:03:02 serv7 cmcld: Attempting to adjust cluster membership
Oct 13 20:03:04 serv7 cmcld: Resumed updating safety time
Oct 13 20:03:04 serv7 cmcld: 2 nodes have formed a new cluster, sequence #17
Oct 13 20:03:04 serv7 cmcld: The new active cluster membership is: serv8(id=2), serv7(id=1)
Oct 13 20:02:59 serv7 cmcld: Communication to node serv8 has been interrupted
Oct 13 20:03:09 serv7 cmcld: Warning: cmcld process was unable to run for the last 3 seconds
Oct 13 20:06:57 serv7 cmcld: Warning: cmcld process was unable to run for the last 6 seconds
Oct 13 20:07:17 serv7 cmcld: Warning: cmcld process was unable to run for the last 17 seconds,
Oct 13 20:02:59 serv7 cmcld: Attempting to form a new cluster
Oct 13 20:07:17 serv7 cmcld: which is longer than the node timeout (10 seconds)
Oct 13 20:07:17 serv7 cmcld: Communication to node serv8 has been interrupted
Oct 13 20:07:17 serv7 cmcld: Node serv8 may have died
Oct 13 20:07:17 serv7 cmcld: Attempting to form a new cluster
Oct 13 20:07:27 serv7 cmcld: Warning: cmcld process was unable to run for the last 8 seconds
Oct 13 20:07:27 serv7 cmcld: WARNING: In the last hour, the ServiceGuard daemon
Oct 13 20:07:27 serv7 cmcld: experienced 3 short OS hangs of 5 or more seconds.
Oct 13 20:07:27 serv7 cmcld: Multiple short hangs or a longer single hang could
Oct 13 20:07:27 serv7 cmcld: lead to a system TOC.
Oct 13 20:07:28 serv7 cmcld: Resumed updating safety time
Oct 13 20:07:30 serv7 cmcld: 2 nodes have formed a new cluster, sequence #18
Oct 13 20:07:30 serv7 cmcld: The new active cluster membership is: serv8(id=2), serv7(id=1)
Oct 13 20:07:41 serv7 xntpd[9676]: Previous time adjustment incomplete; residual -0.000002 sec
Oct 13 20:07:57 serv7 xntpd[9676]: Previous time adjustment incomplete; residual -0.000005 sec
Oct 13 20:10:00 serv7 cmcld: Warning: cmcld process was unable to run for the last 3 seconds
Oct 13 20:11:02 serv7 cmcld: Warning: cmcld process was unable to run for the last 14 seconds,
Oct 13 20:11:02 serv7 cmcld: which is longer than the node timeout (10 seconds)
Oct 13 20:11:02 serv7 cmcld: Communication to node serv8 has been interrupted
Oct 13 20:11:02 serv7 cmcld: Node serv8 may have died
Oct 13 20:11:02 serv7 cmcld: Attempting to form a new cluster
Oct 13 20:11:04 serv7 cmcld: Resumed updating safety time
Oct 13 20:11:08 serv7 cmcld: 2 nodes have formed a new cluster, sequence #19
Oct 13 20:11:08 serv7 cmcld: The new active cluster membership is: serv8(id=2), serv7(id=1)
Oct 13 20:11:44 serv7 cmcld: Warning: cmcld process was unable to run for the last 32 seconds,
Oct 13 20:11:44 serv7 cmcld: which is longer than the node timeout (10 seconds)
Oct 13 20:11:44 serv7 cmcld: Communication to node serv8 has been interrupted
Oct 13 20:11:44 serv7 cmcld: Node serv8 may have died
And That is the last entry in syslog.log!!
Any suggestions gratefully received.

4 REPLIES 4
Mark Grant
Honored Contributor

Re: Serviceguard and cmcld problem

Looks to me liek you might be experiencing some trouble on your heartbeat. It would be worth investigating wether there is actually a fault or, if you use your primary lan for heartbeat, wether it is overloaded.
Never preceed any demonstration with anything more predictive than "watch this"
melvyn burnard
Honored Contributor
Solution

Re: Serviceguard and cmcld problem

you appear to be experiencing what are commonly known as min-hangs in the system, resulting in teh cmcld process not being able to run within the required timeframe.
You would need to look at what was going on around this time, and the best method would be to log a call with your HP Response Centre and get your patching levels checked, as well as the dump analyzed.
One other thing that may influence this, is whether you have a singlew cpu or dual cpu's, amd the amount of memory and/or buffercache in use.
My house is the bank's, my money the wife's, But my opinions belong to me, not HP!
Kent Ostby
Honored Contributor

Re: Serviceguard and cmcld problem

Colin -- Thanks for the post.

There are several possibilities for this type of problem.

In a lot of cases like this, you need to contact HP to get a special troubleshooting program called "timer9" which can detect the case of "mini-hangs" on a system (short hangs that wouldnt necessarily be noticed by users but which hold off cmcld enough).

There have been some cases where vhand needed to be patched, others where there was a machine on the network generating a storm of network requests.

Shorting of asking you to patch vhand, ARPA, LAN, Streams, SCSI, and LVM, I'd suggest that you probably want to open a support call with HP .

Since you have a TOC dump, it will allow the engineers to look at what was happening on the system.

They will also probably want OLDsyslog.log file on the machien that died, syslog.log on the machine that lived and the /tmp/scancl.out file which is generated by running the "cmscancl" on either node.

Hope this helps,

Best regards,

Kent Ostby
"Well, actually, she is a rocket scientist" -- Steve Martin in "Roxanne"
Colin D. Carruthers
Occasional Advisor

Re: Serviceguard and cmcld problem

Thanks for the replies everyone, it was very helpful. We will raise it as a call with the respone centre. You all suggested a problem with performance or the heatbeat. That server does have a backup heartbeat lan, so we do think there is a performance issue. Thank you again for your replies.