Operating System - HP-UX
1834346 Members
1851 Online
110066 Solutions
New Discussion

Two node SG cluster with one network

 
SOLVED
Go to solution
swaggart
Advisor

Two node SG cluster with one network

Hi,

I have a two node cluster running HP-UX 11.31 and SG 11.18.
Their connected with only one vlan, of course also transporting heartbeats.

I'm trying to test failover by unplugging the lan cable on the node running the package, and expect the package to go down and start on the second node.
The only thing happening is that the second node reboots.

Can anyone help me with this config ?

Regards
8 REPLIES 8
Steven E. Protter
Exalted Contributor
Solution

Re: Two node SG cluster with one network

Shalom,

You have not properly configured SG.

SG needs to have a separate network for heartbeat or it can not respond normally to network problems.

A hub between two non-primary NIC cards is enough.

The second node rebooting is called TOC, transfer of control. This is a normal response to loss of heartbeat.

The two nodes race for control of the lock device and the second node is losing this race and gets booted to avoid data corruption.

Take a look at the logs and you will see the response is normal. Your configuration is not robust and is unreliable by design.

SEP
Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
melvyn burnard
Honored Contributor

Re: Two node SG cluster with one network

If you only have one network connection, you have an unsupported configuration.
What you are seeing indicates to me that you are using a cluster lock disk, and not a Quorum server, and hence this is often a normal scenario given only one heartbeat network connection.
The server that gets the cluster lock will stay up,and the other node will be forced to TOC.
I also guess that the cluster lock disk is in a VG that the package uses, and so as the node running the package has the VG activated, it has the faster access to the lock.

Consider using more than one network for a stansby or additional heartbeat, or use a QS rather than a cluster lock disk

My house is the bank's, my money the wife's, But my opinions belong to me, not HP!
Jeeshan
Honored Contributor

Re: Two node SG cluster with one network

did you configure LAN failover?

i guess this is for there is no LAN failover configuration.
a warrior never quits

Re: Two node SG cluster with one network

Yes I can help you with this config...

by telling you its not a supported configuration:

http://docs.hp.com/en/B3936-90122/ch02s02.html

How on earth do you expect this sort of configuration to work? With only one network connection and a 2 node cluster, if the network connection is broken, the other node doesn't "know" the state of the first node - I'm assuming you're using a cluster lock disk - so in this case you'll get a race for the cluster lock as the only way to detemine cluster membership. Unfortunately the node that *you* know is good, loses the race (of course the cluster nodes have no way of knowing who is good, or at least no way of knowing who is *better*)

THis sort of config can be made to work a little better if you use a quorum server on a third node somewhere instead of a cluster lock disk - that way, only a node with a surviving network connection can win a race for a cluster lock.

HTH

Duncan

I am an HPE Employee
Accept or Kudo
swaggart
Advisor

Re: Two node SG cluster with one network

Ok, I got it now.

Somehow I thought this would be a logical setup.
I will try to get a dedicated heartbest lan connected.

Thank you, all.
VK2COT
Honored Contributor

Re: Two node SG cluster with one network

Hello,

You already got stern warnings that your ServiceGuard configuration is not designed
according to best practices.

In fact, you run unsupported setup.

By the way, HP ACSL lab has created a tool
which can be used by HP Support and
Consulting to optimize the Node Timeout and
Heartbeat Interval values used in
Serviceguard clusters.

The HELM (Heartbeat Exchange Latency
Monitor) tool runs on HP-UX 11iv1
(11.11), 11iv2 (11.23), and 11iv3 (11.31),
and measures latency for the cluster nodes
(which might be caused by network delays or
heavy system loads) over a user defined
period of time. When the HELM run is
complete, the tool outputs the measured
latencies and based on these measurements,
suggests optimized values for the
NODE_TIMEOUT and HEARTBEAT_INTERVAL cluster
configuration parameters for both standard
Serviceguard clusters and clusters utilizing
the Serviceguard Extension for Faster
Failover product.

When I teach ServiceGuard (coincidentaly,
I am teaching HP H6487 course next week
here in Australia), I always mention HELM
too. Pity not many people are aware of it.

Cheers,

VK2COT
VK2COT - Dusan Baljevic
sujit kumar singh
Honored Contributor

Re: Two node SG cluster with one network

hi

the syslog cn be referred to as to confirm that the Node on which the package was running that only happens to be Clutser manager during the cluster reform.

in the action of taking the Heart-beat cable out of the Primary node, a cluster reformation occurs, in which the Active node on which the Cluster manager had been sitting earlier happens to become the master and the coordinator and so it gives TOC to the other node as it no more can receive the Hearbeat from the other node.

This is i think what should happen nomally.

can refer to the syslog of both the nodes for this event.


Regards
Sujit
Emil Velez
Honored Contributor

Re: Two node SG cluster with one network

The reason the package did not shut down on one node is how would the other node know when the package was shut down in order to start it.

So this is how it works. If a node cannot stay in the cluster he tries for the lock disk (if it is a 2 node cluster). The node that gets the lock disk forms a cluster and continues. The node that does not get the lock disk cannot form a cluster. The reason that he cannot shutdown the package is he cannot tell the other node that the package is shutdown before the other node starts it so the only way the failed node can make sure that he is not writing to the disk is to panic.

The node that formed the cluster knows he is the only one to survive but cannot know when the other node finished the stop script but because of the assumption that the other node paniced if it was still alive allows the surviving node to just start the package.

I hope this makes sense and helps