1838629 Members
2610 Online
110128 Solutions
New Discussion

Sever perform crash dump

 
Rashid Hamid
Regular Advisor

Sever perform crash dump

Hi All

I have mcsg running with 2 nodes, hp1(rp7420) and hp2(7400). Problem occured when I pull out primary LAN and standby LAN in hp1, all packages failover to hp2 without any problem, BUT hp1 perform crash dump.

Thanks
I'm Parit Madirono/Parit Betak Boyz
4 REPLIES 4
Patrick Wallek
Honored Contributor

Re: Sever perform crash dump

Yes. That is perfectly normal.

In the event of a failure like that the machine will TOC (transfer of control) which creates a crash dump. The intent, I believe, is to have the crash dump to allow you to look into the root cause of the issue.

Also, since there was a problem, the machine that does not get control will TOC to make sure that all resources necessary for the packages are available to the other node.

This is discussed in detail in the MC/SG manuals, available here:

http://docs.hp.com/en/oshpux11iv2.html#Serviceguard

Patrick Wallek
Honored Contributor

Re: Sever perform crash dump

For more information have a read through the "Responses to Failures" section of Chapter 3 - "Understanding Serviceguard Software Components" of the "Managing Serviceguard" manual. It specifically talks about conditions that can initiate a TOC.

In your case, this quote applies: "A TOC is done if a cluster node cannot communicate with the majority of cluster members for the predetermined time,..." Pulling the lan cables means the other node could not communicate.

The "Responses to Failures" section is here:

http://docs.hp.com/en/B3936-90100/ch03s07.html

The whole manual is available from the link I gave above.
Rashid Hamid
Regular Advisor

Re: Sever perform crash dump

Thanks Patrick for the explanation.
I have another set of MCSG running with 2 nodes cluster, I just pull out primary and standby network and no TOC at all.

I'm Parit Madirono/Parit Betak Boyz
Stephen Doud
Honored Contributor

Re: Sever perform crash dump

Serviceguard uses the concept of heartbeat messages between servers to verify that each member node is active.
If your cluster is configured to send heartbeat on only one LAN and you break that LAN, then Serviceguard has to reform the cluster and identify which nodes in the cluster continue to operate, and which must be rebooted to preserve data integrity.

If other clusters can survive such a test, then they must have multiple heartbeat networks.

cmviewconf will show how which networks are configured for heartbeat.

In the case where all heartbeat paths are broken, Serviceguard must use a rule to decide whether a server must crash or continue operation in th cluster.
In a scenario where HB fails between an even set of nodes (ie 1-1, 2-2, 3-3), Serviceguard requires the use of a cluster lock disk or Quorum Server to arbitrate which half of the remaining cluster is allowed to continue, and consequently, which half must crash.