topic Re: After a reboot in Operating System - HP-UX

After a reboot

Buds — Tue, 20 Oct 2009 09:17:42 GMT

In a three node cluster...two nodes go down. All the pkgs would hence failover on the third node.

In case the third node reboots, would all the packages start on the third node? Acc to me, yes they should.

Hope quorum would not come in picture here

Re: After a reboot

Stephen Doud — Tue, 20 Oct 2009 10:24:56 GMT

If 2 of 3 nodes goes down unexpectedly, the 3rd node would not continue running Serviceguard as dictated by cluster reformation protocol.
See page 117 in the latest "Managing Serviceguard" manual: http://docs.hp.com/en/B3936-90143/B3936-90143.pdf

The rule of thumb is:
>50%
If, after a sudden loss of HB connection to other nodes, the remaining nodes are a majority subset of the original cluster, they automatically reform a cluster, continue operations of currently running packages on these node, and adopt dead node packages.

=50%
If, after a HB outage, an even split occurs between active nodes, arbitration in the form of quorum server or cluster lock disk must be sought to receive permission (or denial) to reform a cluster. The first side to contact the arbitrator reforms the cluster. The last side to reach the arbitration device must TOC/reboot to insure data integrity.

<50% (your case)
If, after a HB outage, remaining node(s) find themselves in a minority subset of the original cluster, must TOC/reboot to insure data integrity. The assumption here, is that a majority of nodes survived the HB failure and will take control. This is necessitated by a choice as to how to deal with a uneven split between active nodes.

--------
If the scenario you describe however, is the result of graceful exits (cmhaltnode) by 2 nodes in sequence wherein the remaining node was the sole operator, normal package failover protocol dictates whether package failover to the last node would occur.

Remember that each package has been configured with either a list of adoptive nodes or a * in the NODE_NAME parameter. If a list, the adoptive node list may be a subset of all nodes, and the list dictates failover order in the event of a node departure.
If a *, any node may be an adoptive node for the package.
If FAILOVER_POLICY is configured to MIN_PACKAGE_NODE, the package is moved to the remaining node that has the fewest packages on it.

Ultimately, if 2 nodes leave the cluster gracefully and package AUTO_RUN and node_switching are enabled for their packages, the packages will halt on the departing node(s) and will be started on the last node.

+NOTE: upper-case package parameters denote the package was configured with legacy format. Change to lower-case for modular package format.

Re: After a reboot

Buds — Wed, 21 Oct 2009 11:11:20 GMT

Considering the policies to be automatic failover=yes and configured node= 3rd node all packages would failover to the third node.
Now, in case the third node also reboots, the first two still being down, would all the packages start on the third node?

Re: After a reboot

Stephen Doud — Thu, 22 Oct 2009 11:20:39 GMT

After rebooting, a server may run '/sbin/init.d/cmcluster start' if the /etc/rc.config.d/cmcluster contains AUTOSTART_CMCLD=1.
The cmcluster script eventually performs cmrunnode. That command has 2 modes of operation:
1) If partner nodes are reachable and running Serviceguard, join that cluster
2) If partner nodes are reachable but not running a cluster, start cluster formation stage, waiting for -ALL- other nodes to "vote" to start the cluster. This wait stage lasts for AUTOSTART_TIMEOUT period (default 10 minutes as identified in the cluster config file) before terminating with node cluster daemons running.

In your scenario of 2 nodes down, the 3rd node running boot-time scripts will, upon running /sbin/rc3.d/S800cmcluster attempt to form a cluster, wait 10 minutes and cease attempting to form a cluster with it's peers.

The reason being that this one node cannot know the status of the other cluster member nodes - they may be running but the rebooting node may not be able to contact them due to a local heartbeat network NIC failure.