Operating System - HP-UX
1752568 Members
5429 Online
108788 Solutions
New Discussion юеВ

Re: Failover -clustering overview

 
SOLVED
Go to solution
Naj
Valued Contributor

Failover -clustering overview

Hi expert,

Can someone give me little bit overview about clustering. Since i new on this term, i hope you can give me short overview. I have sample cluster configured in one server.what i can see is one of the package were down and what the step should taken to bring back to up n running status.

Below is my theory (need your expert to correct if wrong)

1. As i understand each package will be send "live" signal by ping to cluster (omniback - i assume this is master)
2. Once one of the node didn't give any response, it would be marked as down status and halted state.
3.another node will take over down node and run as usual until it swap back to up status.

Please correct me if above explanation wrong.

simple output

xst002a:/root/home/root (root) cmviewcl

CLUSTER STATUS
omniback up

NODE STATUS STATE
xst002a up running
xst002b up running

UNOWNED_PACKAGES

PACKAGE STATUS STATE AUTO_RUN NODE
xst002 down halted disabled unowned

More details

xst002a:/root/home/root (root) cmviewcl -v

CLUSTER STATUS
omniback up

NODE STATUS STATE
xst002a up running

Cluster_Lock_LVM:
VOLUME_GROUP PHYSICAL_VOLUME STATUS
/dev/vglock /dev/dsk/c12t0d1 up

Network_Parameters:
INTERFACE STATUS PATH NAME
PRIMARY up 0/4/2/0 lan1
PRIMARY up 0/6/1/0 lan3

NODE STATUS STATE
xst002b up running

Cluster_Lock_LVM:
VOLUME_GROUP PHYSICAL_VOLUME STATUS
/dev/vglock /dev/dsk/c10t0d1 up

Network_Parameters:
INTERFACE STATUS PATH NAME
PRIMARY up 0/4/2/0 lan1
PRIMARY up 0/6/1/0 lan3

UNOWNED_PACKAGES

PACKAGE STATUS STATE AUTO_RUN NODE
xst002 down halted disabled unowned

Policy_Parameters:
POLICY_NAME CONFIGURED_VALUE
Failover configured_node
Failback manual

Script_Parameters:
ITEM STATUS NODE_NAME NAME
Subnet up xst002a 136.4.200.0
Subnet up xst002a 136.4.3.0
Subnet up xst002b 136.4.200.0
Subnet up xst002b 136.4.3.0

Node_Switching_Parameters:
NODE_TYPE STATUS SWITCHING NAME
Primary up enabled xst002a
Alternate up enabled xst002b

Thanks in advanced

BR
Naj


____________________________________________
:: Really appreciate if you could assign some points.
:: Don't know how to assign point? Click the KUDOS! star!
5 REPLIES 5
Matti_Kurkela
Honored Contributor
Solution

Re: Failover -clustering overview

If you are going to work as a Serviceguard cluster administrator, you *really* should read the "Managing Serviceguard" book. It is downloadable as a PDF like all the other Serviceguard documents, here:

http://www.hp.com/go/hpux-serviceguard-docs

The "Managing Serviceguard" book begins with an introduction on cluster concepts and structure of Serviceguard. Later chapters include step-by-step instructions on designing and configuring a cluster, and also step-by-step instructions for basic cluster maintenance operations.

Each edition of the book is meant for a specific version of Serviceguard: run "cmversion" to identify your Serviceguard version, then pick the respective edition of the book.

Seriously, read the book. It has all the information you need, presented very clearly.

-----

1.) This cluster named "omniback" has two nodes: xst002a and xst002b. Both are up and running the Serviceguard cluster software at the moment.
The "live" signal is not exactly a ping; it is known as "cluster heartbeat". It is sent by the cluster software, not by the package.

Currently, the package xst002 (= the application that the cluster is supposed to be running) is not running on any cluster node.

2.) It's more complicated than that. If, for example, xst002a fails to get any response from xst002b, it might be because xst002b is dead... or it might be because the network is broken but xst002b is just fine.

In the latter case, xst002b will also notice it has not got any response from xst002a.

If both nodes would just assume the other node has died, then both nodes would end up trying to run the package. This would mean two nodes accessing the same shared disks and presenting the same package IP address simultaneously. This would be a "split-brain" situation, and it would lead to data corruption and other problems. That must be avoided at all costs.

Therefore, if xst002a sees xst002b is not responding (or vice versa), it will first attempt to access the cluster lock.

In a normal situation, the cluster lock includes a record that says "both nodes xst002a and xst002b are members of the cluster". If connection to xst002b has been lost, xst002a will attempt to replace this record with "only xst002a is a member of the cluster". If xst002b is still alive (i.e. the loss of connection is caused by network failure), it attempts to access the cluster lock too... but if xst002a happened to update it first, then xst002b has "lost the cluster lock" and it "knows" that xst002a will be taking over the cluster package at any moment.

If the node that lost the cluster lock is still alive, it will intentionally crash itself, to make absolutely sure it will not write anything more to the shared disks.

3.) Nodes will only take over packages; a Serviceguard node will never take over the identity of another Serviceguard node.

-----

The xst002 package is currently down (halted). Since its Node_Switching_Parameters section in "cmviewcl -v" output indicates SWITCHING is enabled for both nodes, it is possible the package was halted intentionally, using the cmhaltpkg command.

To restart the package, you can run:

cmrunpkg xst002

It will start the package on the same node the command is entered on.

Alternatively, you can specify the node you wish to start the package on. To start the package on xst002a, you would use this command:

cmrunpkg -n xst002a xst002

After the cmrunpkg command, the package starts up, but its AUTO_RUN attribute will still be disabled (it was automatically disabled when the package was halted). This means the package will not failover automatically. To "re-arm" the failover mechanism, you must enter another command after starting the package:

cmmodpkg -e xst002

For some reason, I've seen many people miss this step. Serviceguard will provide automatic failover (and thus High Availability) for the package *only* when the AUTO_RUN is enabled.

MK
MK
Naj
Valued Contributor

Re: Failover -clustering overview

Hi Matti Kurkela,

Thanks for the explaination,based on your writing i would say that omniback is a cluster which is consist the package such as xst002,xst002a and xst002b. (correct me if wrong)

I need to understand some clause were used in serviceguard. Sometime it is confusing me and below some question

1. cluster=package (store in same server?)
2. how many package can be run in one node?
3. What is max node can be in a cluster
4. cluster hertbeat is software which monitor both node and it may configure to detech failover.is that correct?

troubleshooting question.

1. for some cases we cannot bring up node back to normal and is there has any debug command or log messages?

Thanks

BR
Naj


____________________________________________
:: Really appreciate if you could assign some points.
:: Don't know how to assign point? Click the KUDOS! star!
g3jza
Esteemed Contributor

Re: Failover -clustering overview

Hi:

1. cluster=package (store in same server?)
No, cluster is a collection of nodes, on which the package/packages are running.

2. how many package can be run in one node?
It depends on your version of serviceguard, and the cluster parameter MAX_CONFIGURED_PACKAGES , so you should read user guide for your version.

3. What is max node can be in a cluster
It can contain up-to 16 nodes, but the max number of nodes in a cluster is dependent on the specific storage and volume manager used.

4. cluster hertbeat is software which monitor both node and it may configure to detech failover.is that correct?
Cluster heartbeat is a "signal" for other members of the cluster, that the particular node is up. If the HB from node is lost, that node is considered down (in the cluster) and cluster reformation will occur, with new number of active nodes. Cluster HB is not a standalone software, it's part of the SG package/bundle.

troubleshooting question.

1. for some cases we cannot bring up node back to normal and is there has any debug command or log messages?

You can see the overall status of the cluster/packages with #cmviewcl -v

If you have problems bringing the node back to the cluster, then you should always check /var/adm/syslog/syslog.log on both nodes for more information.
Naj
Valued Contributor

Re: Failover -clustering overview

Hi expert,

I've checked /var/adm/syslog/syslog.log but it would not giving details about the issue and mixed up with another error. Is it possible if we enter into cluster (in this case omniback is cluster / main host - correct me if wrong) to check error log which is giving more specific on the issue ( i don't know where is located or is it visible?).
If you have step to troubleshoot this please state here. i would much appreciate it.

xst002a:/root/home/root (root) cmviewcl

CLUSTER STATUS
omniback up

NODE STATUS STATE
xst002a up running
xst002b up running

UNOWNED_PACKAGES

PACKAGE STATUS STATE AUTO_RUN NODE
xst002 down halted disabled unowned

Thanks

BR
Naj

____________________________________________
:: Really appreciate if you could assign some points.
:: Don't know how to assign point? Click the KUDOS! star!
g3jza
Esteemed Contributor

Re: Failover -clustering overview

What SG version are you using?

Newer releases of SG are using /var/adm/cmcluster/log/package_name.log . In older releases it is by default in /etc/cmcluster/package_name/package_name.cntl.log

Try running the package with cmrunpkg command and check the logfile.

When trying to start/halt the package, all the process is logged into that file. Please show the output of that log file.