Operating System - HP-UX
1837584 Members
2926 Online
110117 Solutions
New Discussion

Re: Serviceguard membership error

 
Jeff_Traigle
Honored Contributor

Serviceguard membership error

Hopefully one of the Serviceguard gurus on here can shed some light on this. We had a network outage that caused complete connectivity loss between the two nodes of a cluster. Things got in a rather funky state, which I'm still trying to piece together from syslog entries... but it appeared the passive node, host2, (which was the one that had network connectivity yanked out from under it) gained control of the lock disk during cluster reformation. Although it supposedly released it when it couldn't start the package because of the network problem, host1 never regained control automatically. In fact it hung (no entries in syslog) for 20 minutes before it TOC'ed early in the episode. A couple of hours later, host2 was reconnected to the network and the following error occurred on host1 and it TOC'ed again.

daemon-err 2009-08-09 03:19:16 cmcld: Serviceguard fatal error in membership, sbd/sbd.c 556
daemon-err 2009-08-09 03:19:16 cmcld: Remote members: host2
daemon-err 2009-08-09 03:19:16 cmcld: Local members: host1
daemon-err 2009-08-09 03:19:16 cmcld: Could not enable safety time.
daemon-info 2009-08-09 03:19:16 cmcld: Aborting: sbd/sbd.c 672 (FATAL MEMBERSHIP ERROR DETECTED)
--
Jeff Traigle
3 REPLIES 3
Stephan.
Honored Contributor

Re: Serviceguard membership error

Taken from "Managing Serviceguard Sixteenth Edition"

http://docs.hp.com/en/B3936-90140/ch03s01.html#d0e2638

>> The cmcld daemon sets a safety timer in the kernel which is used to detect kernel hangs. If this timer is not reset periodically by cmcld, the kernel will cause a system TOC (Transfer of Control) or INIT, which is an immediate system reset without a graceful shutdown. (This manual normally refers to this event simply as a system reset.) This could occur because cmcld could not communicate with the majority of the clusterâ s members, or because cmcld exited unexpectedly, aborted, or was unable to run for a significant amount of time and was unable to update the kernel timer, indicating a kernel hang.
melvyn burnard
Honored Contributor

Re: Serviceguard membership error

It would help if we knew what SG version, and what SG patch is installed on both nodes
what /usr/lbin/cmcld
My house is the bank's, my money the wife's, But my opinions belong to me, not HP!
Jeff_Traigle
Honored Contributor

Re: Serviceguard membership error

It would indeed. Sorry about that.

HP-UX 11.11
Serviceguard A.11.16

#-> what /usr/lbin/cmcld
/usr/lbin/cmcld:
HP92453-02A.11.00 HP-UX SYMBOLIC DEBUGGER (END.O ILP32) $Revision: 75.02 $
Build date: Mon Nov 12 11:52:56 PST 2007
Build id: ibld_sg_a1116patch_1111_product
Build platform: hpux
Cluster Monitor Product $Revision: 82.2 $
Cluster Monitor Product Only $Revision: 82.2 $
Daemon
A.11.16.00 Date: 11/12/07 Patch: PHSS_36898
--
Jeff Traigle