topic Re: Node failure in Operating System - HP-UX

Node failure

Cristi BODNARIUC — Fri, 11 Jul 2003 09:42:11 GMT

Hi,

I installed a 2 nodes cluster with SG A.11.14 with one shared SCSI external drive (configured as lock device) and only one lan on each node (which is also the heartbeat lan).

I try to test a failover but the things do not go as I expect.

I have 2 cases:

1) I get out the network cable of node 1
2) I power off node 1

In both cases the second node reboots.
I expected that it will host all the resources previously on the node1.

After reboot node2 can not even form the cluster, complaining that it can not get the OS version of node1. Shouldn't it go on running the cluster?

Do I have to do a special configuration?
What could be not well configured?

Thanks,
Cristi

Re: Node failure

melvyn burnard — Fri, 11 Jul 2003 09:46:16 GMT

Do you not have a standby lan for the heartbeat? if so, hen you may very well see the incorrect node TOC.
How are the nodes connected via lans, and how is the scsi connected? what are the disc and controller scsi addresses? what do the syslogs and OLDsyslogs show on each node?
Read the manuals at http:/docs.hp.com/hpux/ha for an idea on how to configure the cluster

Re: Node failure

Bernhard Mueller — Fri, 11 Jul 2003 09:53:57 GMT

Hello,

that is why you should have a phyiscally separate HeartBeat-LAN.

If you have no LAN communication between the nodes at all, they both run for the lock disk to decide which one has to TOC. The other one will carry on as a one node cluster. So chances are 50% the node you expected to TOC will TOC.....

This is called arbitration. There is a lot of information about it in the manuals at docs.hp.com

Regards
Bernhard

Re: Node failure

Cristi BODNARIUC — Fri, 11 Jul 2003 10:51:02 GMT

Hi,

I thought that for the first case (when taking out the network cable from node1) the problem could come from the fact that there is no dedicated heartbeat way.

But when switching the power off on node1 I think it is not a heartbeat problem anymore and the second node should not TOC.

Maybe I am stil missing something. I will keep reading the manuals :)

The 2 nodes are connected to the company network (both are conected to a switch).
The external disk has 2 ends, one connected to node1 and the other to node2. It is powered separately.

After the reboot of node2 I can start the cluster with cmruncl -n node2 and the cluster runs well.

Re: Node failure

Karthik S S — Fri, 11 Jul 2003 11:11:35 GMT

Hi,

If you are running short of NICs better you configure the heartbeat on RS232. Refer to the SG documentation on how to set this up. Also a quick requirements for heartbeat could be found at,

http://www.netsysco.com/pdf/Manuals/Sg/HeartbeatReq.pdf

Regards,
karthik S S

Re: Node failure

melvyn burnard — Fri, 11 Jul 2003 11:25:33 GMT

Again, what do your syslog and OLDsyslog files tell you.
Is your cluster lock disc actually working? what type of disc is it?
And I would not recommend a serial heartbeat unless you cann really not afford at least another lan card.

Re: Node failure

Bernhard Mueller — Fri, 11 Jul 2003 11:48:45 GMT

Cristi,

I believe there could be a problem with the binary cmclconfig file, since your assumptions for case 2 are correct. So delete them and do another chcheckconf / cmapplyconf.

One other thing to check is your .rhosts or cmclnodelist to include BOTH nodes on BOTH nodes. That could be the problem why node2 cannot form the cluster but a cmruncl -n node2 will work.

Regards,
Bernhard

Re: Node failure

Cristi BODNARIUC — Tue, 15 Jul 2003 15:54:01 GMT

Hi,

Thank you all for your help.

It seems that he problem was in the binary cluster config file which I did not compile/redistribute after I have changed the SCSI disk (with one with different SCSI ID).

Now if I shutdown a node the other takes over all the packages.

Thanks,
Cristi