Operating System - HP-UX
1834022 Members
3040 Online
110063 Solutions
New Discussion

How does two-node cluster works?

 
zhaogui
Super Advisor

How does two-node cluster works?

I was wondering in a two-node cluster environment if one node hangs, how can another node forces a fail-over and take over cluster lock disk? Because I believe when one node hangs it is still holding on the cluster lock-disk, how can another node take over it?
Thanks,
8 REPLIES 8
Kenny Chau
Trusted Contributor

Re: How does two-node cluster works?

Hi,

Hope this documents can answer your questions.

http://docs.hp.com/hpux/onlinedocs/B3936-90024/B3936-90024.html

Hope this helps.
Kenny.
Kenny
James R. Ferguson
Acclaimed Contributor

Re: How does two-node cluster works?

Hi:

If the normal communication heartbeat between nodes in a cluster ceases, the 'cmcld' deamon on *both* hosts will attempt to obtain control of the cluster lock disk. In this "race", the first node to reach the lock disk marks it as its own. When the other node notes this update, it performs a TOC (Transfer of Control = reboot) leaving the first node to reach the lock the package owner.

Regards!

...JRF...
Kenny Chau
Trusted Contributor

Re: How does two-node cluster works?

Or this document may help.

http://docs.hp.com/hpux/onlinedocs/B7491-90001/00/00/96-con.html

Hope this helps.
Kenny.
Kenny
Steven Sim Kok Leong
Honored Contributor

Re: How does two-node cluster works?

Hi,

Btw, you can have two cluster lock disks using the SECOND_CLUSTER_LOCK_DISK parameter and the SECOND_CLUSTER_LOCK_VG.

If you are designing a fault-tolerant cluster, then one failure you should be concerned with is that of the split brain syndrome.

Without an arbitrator, you would get a split brain if the following occurred simultaneously:
1) Heartbeat fails
2) Link from server A (primary node) to cluster lock disk B fails and link from server B (secondary node) to cluster lock disk A fails.

Note that the split brain syndrome can cause data inconsistency. According to HP documents, planning different physical routes for both network and data connections or adequately protecting the physical routes greatly reduces the possibility of split brain syndrome. Also remember that the cluster lock disks should be separately powered, if possible.

If you want a fault-tolerant architecture which avoids the split brain syndrome, you will need at least one arbitrator node. Arbitrators provide functionality like that of the cluster lock disk, and act as tie-breakers for a cluster quorum in case all of the nodes in one data center go down at the same time.

Hope this helps. Regards.

Steven Sim Kok Leong
Sanjay_6
Honored Contributor

Re: How does two-node cluster works?

Hi,

Try the MC/SG FAQ,

http://docs.hp.com/hpux/onlinedocs/ha/haFAQindex2.html

Hope this helps.

Regds
zhaogui
Super Advisor

Re: How does two-node cluster works?

But, if one server already hanged there, how can ir reboot/TOC by itself? In this case, will cmcld bin able to TOC this node? If not, how can this node release all the shared disks it has all the while been occupying?
Steven Sim Kok Leong
Honored Contributor

Re: How does two-node cluster works?

Hi,

MC/ServiceGuard is a cluster solution but no clustering solution in the world can take care of all failure cases and conditions. If your OS of the primary node somehow gets corrupted and is in an unstable state whereby the failure conditions are not yet met, your secondary node will not takeover. Example is a disk failure or data corruption.

As such, it is important to have a complete fault-tolerant architecture which includes hardware RAID arrays (eg. a SAN solution).

Hope this helps. Regards.

Steven Sim Kok Leong

Re: How does two-node cluster works?

ServiceGuard can in most situations cause hung nodes to TOC themselves - this is achieved using a timer within the kernel, referred to within the SG documentation as the kernel safety timer. The timer is constantly counting down to 0 within the kernel, but every time cmcld grabs some CPU time cycles, it resets the timer. That way if the system seems to be hung in any way, it doesn't matter that cmcld never gets cpu cycles to check for this, as the kernel counts the timer down to zero, and causes the node to TOC.

You can simulate this by killing the cmcld daemon - as the daemon can't now update the safety timer, the node TOCs shortly after this.

HTH

Duncan

I am an HPE Employee
Accept or Kudo