Operating System - HP-UX
1835949 Members
3511 Online
110088 Solutions
New Discussion

loss of network with MCSG

 
Thevenet_1
Occasional Advisor

loss of network with MCSG

Hello

I made the following test on one of our 2 MCSG nodes :
I've disconnected all network links (even HeartBeat) from node A. I was surprised to see that the lock was acquired by node A and that node B (the only node working well) performed a TOC.

Could you please confirm that this is a regular behaviour ?

Thanks
4 REPLIES 4
Jeff Schussele
Honored Contributor

Re: loss of network with MCSG

Where was the package running at the time?
If it was on node A the TOC of node B would make sense.
If it wasn't on node A then the lock race was obviously won by node A resulting in the TOC of node B.
You have to remember that the nodes are not checking themselves but are checking for the existence of their fellow node members. So when node B could no longer "see" node A it decided to TOC itself so that there would be no "split-brain" possibility.

If you had only pulled the public network and not the heartbeat the pkg would have failed over to node B. But by pulling the heartbeat as well you forced a lock contention which node A won.

Rgds,
Jeff
PERSEVERANCE -- Remember, whatever does not kill you only makes you stronger!
Stephen Doud
Honored Contributor

Re: loss of network with MCSG

There is no way for a server to know where the network is disconnected externally from a server short of Time Domain Reflection test.
Serviceguard only determines that it cannot communicate with the other server.

When both nodes experience the same breakdown in heartbeat connection, they both seek the cluster arbitration device, and whichever one gets to it first is authorized to reform a 1-node cluster... and the last arriving node if forced to TOC(dump/reboot).
Kent Ostby
Honored Contributor

Re: loss of network with MCSG

That is quite possible.

Keep in mind that all that is going on is a race to the lock disk to decide who will stay up and who will stay up.

You may want to check out this document which discusses the differences between using a lock disk and using a quorem server:

http://www2.itrc.hp.com/service/cki/docDisplay.do?docLocale=en_US&docId=UMCSGKBRC00012642

ITRC DOC ID: UMCSGKBRC00012642
"Well, actually, she is a rocket scientist" -- Steve Martin in "Roxanne"
melvyn burnard
Honored Contributor

Re: loss of network with MCSG

This is totally possible, as at the time of ALL of the networks failing, each node assumes the other is dead and therefore goes for the lock disc. The first one to get it will lock it out for the other node, without caring about where the package is currently running.
Remember, you induced a MPOF which Serviceguard is not generally designed to cater for.
There is one solution and that is to use a serial heartbeat, but this can be troublesome in it's own right.
Also, having a Quorum Server would help as node A would not have been able to get to the lock.
My house is the bank's, my money the wife's, But my opinions belong to me, not HP!