Operating System - HP-UX
1755151 Members
5415 Online
108830 Solutions
New Discussion юеВ

cmhaltpkg hangs the system

 
SOLVED
Go to solution
moonchild
Regular Advisor

cmhaltpkg hangs the system

2 nodes cluster running HPUX 11.11
we are trying to failover a package to node2 but hangs, so we had to reboot the system.
After rebooting, failing the package works fine now.

#cmhaltpkg -v pkg1

we noticed the following in syslog:
Feb 16 14:30:07 node1 cmcld: Request from node node2 to halt package pkg1 on node node1.
Feb 16 14:30:07 node1 cmcld: Executing '/etc/cmcluster/pkg1/pkg1.cntl stop' for package pkg1, as service PKG*54273.
Feb 16 14:30:07 node1 cmsrvassistd[20952]: Unable to communicate with ServiceGuard main daemon (cmcld): Can't assign requested address
Feb 16 14:30:07 node1 cmcld: Service PKG*54273 terminated due to an exit(118).
Feb 16 14:30:07 node1 cmcld: Halted package pkg1 on node node1.
Feb 16 14:30:07 node1 cmcld: Package pkg1 halt script exited abnormally.
Feb 16 14:30:07 node1 cmcld: Examine the file /etc/cmcluster/pkg1/pkg1.cntl.log for more details.
Feb 16 14:30:07 node1 cmcld: Switching disabled on package pkg1.

after rebooting, the system is up and running fine and cmhaltpkg failover successfully.

Any ideas why the hang happened?

thank you in advance
10 REPLIES 10
Prashanth Waugh
Esteemed Contributor
Solution

Re: cmhaltpkg hangs the system

Hi,
The cmcld daemon sets a safety timer in the kernel which is used to detect kernel hangs. If this timer is not reset periodically by cmcld, the kernel will cause a system TOC, that is, a Transfer of Control, which means a CPU reset. This could occur because cmcld could not communicate with the majority of the cluster├в s members, or because cmcld exited unexpectedly, aborted, or was unable to run for a significant amount of time and was unable to update the kernel timer, indicating a kernel hang. Before a TOC due to the expiration of the safety timer, messages will be written to the syslog file and the kernel├в s message buffer.
Regards
Prashant
For success, attitude is equally as important as ability
Prashanth Waugh
Esteemed Contributor

Re: cmhaltpkg hangs the system

Hi,

Pls paste the o/p of file
/etc/cmcluster/pkg1/pkg1.cntl.log for more details.

Regards
Prashant
For success, attitude is equally as important as ability
Suraj K Sankari
Honored Contributor

Re: cmhaltpkg hangs the system

Hi,
See the 3rd line
>>Feb 16 14:30:07 node1 cmsrvassistd[20952]: Unable to communicate with ServiceGuard main daemon (cmcld): Can't assign requested address

Please check your cmcld demon is running or not ?

Are you able to do cmviewcl ?

Suraj
moonchild
Regular Advisor

Re: cmhaltpkg hangs the system

attached is the cntl file

we also have a crash file created
moonchild
Regular Advisor

Re: cmhaltpkg hangs the system

Suraj,

yes cmcld is running and yes we can run cmviewcl

thanks
Wim Rombauts
Honored Contributor

Re: cmhaltpkg hangs the system

What version of serviceguard are you running, and have you checked that you have the latest serviceguard patch installed or if there is a patch that desribes the error messages you have ?

It 's just that this souns like a serviceguard internal issue.
smatador
Honored Contributor

Re: cmhaltpkg hangs the system

Hi,
Agree with Wim about SG patches.
there should be a cmcld core file in /var/adm/cmcluster. It is advisable to install a current Serviceguard patch.
moonchild
Regular Advisor

Re: cmhaltpkg hangs the system

SG rev 11.15

#cmgetconf |grep E_T /etc/cmcluster/*

/etc/cmcluster/edlrp.tar:# Enter the package type for this package. PACKAGE_TYPE indicates
/etc/cmcluster/edlrp.tar:# NOTE: Packages which have a PACKAGE_TYPE of SYSTEM_MULTI_NODE are
/etc/cmcluster/edlrp.tar:# Examples : PACKAGE_TYPE FAILOVER (default)
/etc/cmcluster/edlrp.tar:# PACKAGE_TYPE SYSTEM_MULTI_NODE
/etc/cmcluster/edlrp.tar:PACKAGE_TYPE FAILOVER
/etc/cmcluster/gifcl.conf:# ServiceGuard cluster parameters, including NODE_TIMEOUT and
/etc/cmcluster/gifcl.conf:# The NODE_TIMEOUT parameter defaults to 2000000 (2 seconds).
/etc/cmcluster/gifcl.conf:# The maximum value recommended for NODE_TIMEOUT is 30000000
/etc/cmcluster/gifcl.conf:NODE_TIMEOUT 8000000
/etc/cmcluster/gifcl.conf.new:# ServiceGuard cluster parameters, including NODE_TIMEOUT and
/etc/cmcluster/gifcl.conf.new:# The NODE_TIMEOUT parameter defaults to 2000000 (2 seconds).
/etc/cmcluster/gifcl.conf.new:# The maximum value recommended for NODE_TIMEOUT is 30000000
/etc/cmcluster/gifcl.conf.new:NODE_TIMEOUT 8000000

and the q4 stack trace shows:
stack trace for event 0
crash event was a TOC
wait_for_lock+0x144
sl_retry+0x1c
safety_time_check+0xfc
per_spu_hardclock+0xc4
clock_int+0x94
mp_ext_interrupt+0x3ec
ivti_patch_to_nop3+0x0
pset_idle_loop+0x120
idle+0x738
swidle_exit+0x0

thanks in advance
John Bigg
Esteemed Contributor

Re: cmhaltpkg hangs the system

One possibility if you have a single CPU system is that you hit defect 5 described in the 11.16 patch PHSS_35862:

5. Defect: JAGag28374 SR: 8606473752
Serviceguard on uniprocessor systems can lead to
cmcld consuming 100% of cpu resulting in a hang or system
TOC. This does not apply to multi-processor systems.

If you are then you need to upgrade since there is no 11.15 patch since this release is no longer supported.