Strange behavoir in a cluster

Lissete Calderón · ‎08-02-2006

Hi,

I have a two-node-cluster running HPUX 11.11 and MC SG 11.16. The cluster has an strange behavior that I don't know if it's normal, it has 2 LAN cards assigned for the heartbeat each node.
Node A is running 3 oracle packages, Node B is running 2 packages. When I disconnect primary and secondary heartbeat LAN from Node B; the cluster is in a reforming state but after that Node B goes down and the 2 packages failover begins to the other side , I mean Node A.
When I try to do the same test , this time Node A is running their 3 packages and Node B is running their 2 packages, I disconnect primary and secondary LAN cards from Node A ; the cluster is in a reforming state but after that Node B goes down and starts creating a dump.

I don’t know why Node A does not go down and makes the fail over of their 3 packages to Node B.

Both tests do the same, Node A remains up and Node B goes down.

It that OK?

Thank you for your quick response and your help.

Lissete C.

melvyn burnard · ‎08-02-2006

this is a two node cluster, correct? are you using a cluster lock disc or a quorum server?
Check the OLDsyslog.log of node B after the second test, and see whether there is anything in there.
I suspect you are using a cluster lock disc, an dwhen you removed the heartbeats, tnode A grabbed the cluster lock, forcing node B to TOC.
do you only have the two LAN,s between the nodes? if tehre are more, put your hearteats over all of them.

My house is the bank's, my money the wife's, But my opinions belong to me, not HP!

Steve Lewis · ‎08-02-2006

Firstly you must remember that MC/SG is designed to guard against SINGLE points of failure. Straight away your test is not valid because you have created more than one failure at the same time. MC/SG is not designed to cope with that.

Secondly, when you unplug both network cards, the servers have to use cluster-lock methods to determine who should own the cluster. If they lose all connection to each other, then both nodes will assume that the other one might have failed, not that you unplugged the cables from their side of the switch.
-unplug both from B, B assumes a has failed, A also assumes B has failed.
-unplug both from A, A assumes B has failed, B also assumes A has failed.

The only difference to your tests is that the other server still has link/carrier connectivity to the switch (unless you are using crossovers but you didn't say that).

I am not suprised that you get different results, but I would half expect in this scenario that both servers might TOC and re-race for the cluster lock when they reboot.

Steven E. Protter · ‎08-02-2006

Shalom,

No, this does not sound normal. I think you should:

tail -f /var/adm/syslog/syslog.log on both systems, not connecting through the floating ip addresses and re-run your tests.

I'm assuming the objective here is for all 5 packages to run on one node when for any reason one node is out of the cluster.

sEP

Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com

A. Clay Stephenson · ‎08-02-2006

Your testing is flawed. You are breaking two things at once and MC/SG is designed to handle single points of failure. In any event, when you disconnect cables like this so that heartbeat is lost. Whoever acquires the cluster lock wins and the other node TOC's to prevent a split-brain syndrome. It's not predictable which node will crash because one can't predict which node will acquire the lock first.

In any event, you should have yanked one network cable (which would simply trigger a LAN failover rather than a node switch), killed one network switch, yanked one SCSI cable, killed one disk array, yanked the power cord on one host, ... all of these are SPOF's but yanking more than one network cable is a MPOF. The fundamental problem is that although heartbeat was lost the conection to the lock disk was still attached to both nodes -- so a crapshoot results.

If it ain't broke, I can fix that.

Lissete Calderón · ‎08-02-2006

Thanks guys for your help,

The cluster has cluster lock disk, and the heart beat are not crossovers.

So there is no way to controle the cluster disk possesion? In order to make the test?

Regards,

Lissete C.

Steven E. Protter · ‎08-02-2006

You can change the cluster lock disk as follows:

1) Edit the cluster control file, pick a different, present disk.
2) cmcheckconf
3) cmapplyconf

SEP

Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com

A. Clay Stephenson · ‎08-02-2006

Not as described. The correct test is to literally yank the power cord(s) on one node and the packages should all move to the surviving node. If you are a bit afraid of killing a UNIX box like this then you haven't built your cluster robustly enough. Your yanking of multiple network cables should actually work as expected sometimes; it all depends on which node wins the lock acquisition race.

If it ain't broke, I can fix that.

melvyn burnard · ‎08-02-2006

Yes, have a further Subnet between the two, with it configured to take the heartbeat as well. Then do your test again, but do not pull this subnet as well as the other two.

The cluster should then stay up, an dthe packages halt on node A and move to node B (provided theya re monitoring teh subnet)

My house is the bank's, my money the wife's, But my opinions belong to me, not HP!

Patrick Wallek · ‎08-02-2006

No you cannot control which node takes the cluster lock disk.

Steven E. Protter · ‎08-02-2006

I failed to understand the question.

The cluster lock disk is a disk that both systems race to control in the event that heartbeat is lost.

The system that loses the race goes TOC, transfer of control to prevent split brain syndrom from corrupting your shared data.

You can try to manipulate the outcome but really it is a race between electrons and nothing you can safely do will change the outcome.

SEP

Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com

Lissete Calderón · ‎08-02-2006

OK guys,

The thing is not to confuse the Service Guard Principles

Thanks a lot for your help

Thomas J. Harrold · ‎08-03-2006

As Melvyn mentions above, a good design will allow the heartbeat over multiple networks. (public and private), to eliminate the loss of a network as a single point of failure.

If you configure your cluster this way, pulling BOTH private heartbeat cables will have absolutely NO effect on your running cluster.

-tjh

I learn something new everyday. (usually because I break something new everyday)

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Strange behavoir in a cluster

Strange behavoir in a cluster

Re: Strange behavoir in a cluster

Re: Strange behavoir in a cluster

Re: Strange behavoir in a cluster

Re: Strange behavoir in a cluster

Re: Strange behavoir in a cluster

Re: Strange behavoir in a cluster

Re: Strange behavoir in a cluster

Re: Strange behavoir in a cluster

Re: Strange behavoir in a cluster

Re: Strange behavoir in a cluster

Re: Strange behavoir in a cluster

Re: Strange behavoir in a cluster