1752754 Members
4581 Online
108789 Solutions
New Discussion юеВ

Re: node fencing

 
SOLVED
Go to solution
Kranti Mahmud
Honored Contributor

node fencing

Dear all,

We are having a cluster environment on two different HP blade. We found both the nodes fenced each other and went down with some errors. Attached is the syslog of both the nodes. Can anyone explain the exact reason why both the nodes are went down simultaneously.

Rgds-Kranti
Dont look BACK as U will miss something INFRONT!
6 REPLIES 6
Matti_Kurkela
Honored Contributor
Solution

Re: node fencing

From the logs you posted in this and your other thread, I see both your nodes lost contact with each other on Mar 25 11:10:52. In that case, each node will assume it is OK ("I run a program, therefore I am"), and the other node has failed. This is a split-brain situation, and it's a very bad thing in a cluster.

To resolve the split-brain situation, each node attempts to fence the other node (on 11:10:56) but the fence connection fails too.

Before the cluster has time to do anything else, on 11:10:57 the network connections between the nodes start working again.
But as both nodes have already decided "I'm the surviving node, and the other one has problems", the nodes are out of sync of each other. To remedy that, openais sends a "you're out of sync - please reboot" message to the other node:

> Mar 25 11:10:57 bgw-node1 openais[6402]: [MAIN ] Killing node 192.168.11.69 because it has rejoined the cluster with existing state

The problem is, the openais on the other node has made the same decision and done exactly the same on node 1:

> Mar 25 11:10:57 bgw-node1 openais[6402]: [CMAN ] cman killed by node 2 because we rejoined the cluster without a full restart

This causes node 1 to begin a controlled shutdown:
> Mar 25 11:10:58 bgw-node1 shutdown[8222]: shutting down for system halt

But meanwhile, node 2 put in a fencing request for node 1 at 11:10:53. At 11:11:34, fenced on node 2 confirms node 1 has been fenced. But because node 1 sent a kill request to node 2's cman at 11:10:57, there is no running cman any more on node 2, because node 2 is performing a shutdown:

> Mar 25 11:11:34 bgw-node2 fenced[6515]: Unable to connect to to cman: Connection refused

On node 2, the clvmd is waiting for DLM locks to clear... but because node 1 has been fenced, and node 2 is shutting down and has already stopped the dlm_controld daemon, there may be some difficulties with that :-)
At 11:14:00, the kernel notices clvmd has has been blocked in a (DLM-related) kernel function for more than 120 seconds and produces a call trace for it.

Mar 25 11:10:57 bgw-node2 dlm_controld[6521]: cluster is down, exiting
Mar 25 11:14:00 bgw-node2 kernel: INFO: task clvmd:6577 blocked for more than 120 seconds.


At about 11:15:55, node2 completes its shutdown and begins to reboot.



After node 1 has rebooted, at 11:14:56, node 1 again cannot detect node 2 (because node 2 is still in the process of shutting down), so after waiting for a while, it decides to fence node 2 at 11:15:45. This time, the fencing operation is successful, and the fence daemon on node 1 confirms successful fencing of node 2 at 11:16:26.

At this point, node 1 can be sure it's the only active node in the cluster.

At 11:19:54, node 2 has completed rebooting and rejoins the cluster.

On Mar 25 12:23:18, the network connection between the nodes was lost again. At 12:23:38, both nodes decided to fence each other.

At 12:23:40, the network connection comes back and the nodes again attempt to re-form the cluster, but it's too late: both nodes have again already decided they're the only surviving node. Again each node sends an openais kill request to the other node, and both nodes start a controlled shutdown. This time, neither node successfully fences each other before the cluster daemons are shut down by the openais kill requests.

Apparently both nodes then remain down until node2 is restarted on Mar 26 13:55:20. It forms a cluster alone at 13:55:26. Since the status of node1 is unknown, node2 fences it at 13:56:14. The fencing is confirmed successful at 13:56:28. Node2 starts the bgw service at 13:57:27.

Node1 starts up at 13:59:44. It joins the cluster at 13:59:51. At 14:01:01, node1's startup is complete and bgw service is automatically moved to node1.


I'd say the main problem with your cluster is that both the cluster communication ("heartbeat") connection and the fencing connection are failing intermittently, sometimes simultaneously. A simple fail-over cluster is typically guaranteed to protect against one fault at a time only: such a cluster usually can behave sensibly in many cases where there are two or more faults, but not all possible cases.

(A guaranteed protection against all possible combinations of two simultaneus failures requires multiply-redundant connections and *very* careful cluster planning. It is usually too complicated and expensive unless the system is actually life-and-death critical, like fly-by-wire systems of an airplane.)

You should take whatever steps are necessary to make sure the cluster network connections and fencing connections won't be interrupted at the same time.

If all your network connections go through a single switch, you may experience a short but total network blackout each time the network administrator resets the switch for configuration change or firmware update. In that case, the redundancy of your network connections needs to be improved. For example, connect each node to two separate (but linked) switches, using two network interfaces bonded together in active/backup mode.


A two-node RHEL cluster can be subject to fencing races, and can be stuck in reboot-and-fence cycles.

To solve that, you need a tie-breaker. One way to do that is to add a third node to your cluster. The third node only needs to be part of the cluster, to participate in the cluster quorum decisions: it does not have to actually run any cluster applications.

Another possible tie-breaker solution would be to configure a quorum disk daemon (qdiskd).

You have both device-mapper-multipath and EMC PowerPath installed on both nodes. This might cause the disk errors: when device-mapper-multipath takes control of a multipathed device, it tries to lock the corresponding /dev/sd* devices so the userspace programs cannot use them. I guess PowerPath might do something similar. Then smartd tries to examine all /dev/sd* devices... and fails.

Don't use both device-mapper-multipath and EMC PowerPath: pick one or the other. Also you might consider configuring smartd so that it monitors local disks only.

(If you have EMC Clarion SAN, then monitoring it with smartd is probably useless: the Clarion will internally monitor its disks, and will produce more meaningful hardware health information of its own. Since the Clarion LUNs don't necessarily have one-to-one relationship with actual Clarion disks, the SMART data on the LUNs is going to be "virtual" data generated by the Clarion anyway.)

MK
MK
Alzhy
Honored Contributor

Re: node fencing

VERY very well written response Matti!
Kudos man!

BTW, how critical are fencing communications? Since ILOs (and other management modules) undoubtedly cannot have a teamed/redundant network?

I am in the process of putting RHCS through its paces as there's a possibility we could drop Poor Man's CLustering in lieue of RHCS for simple failovers. One issue I am facing is a simple network restart on one of the nodes on my 2 node cluster forces a fence on both the nodes. How can this be avoided?

Sorry for hi-jacking your thread Kranti.
Hakuna Matata.
Matti_Kurkela
Honored Contributor

Re: node fencing

@Alzhy:

>One issue I am facing is a simple network restart on one of the nodes on my 2 node cluster forces a fence on both the nodes. How can this be avoided?

If both nodes get fenced, this suggests you aren't using a quorum disk daemon nor any other tie-breaker in your cluster. You've essentially have the same problem Kranti has: the problem looks exactly the same to both nodes (i.e. "the other node does not respond any more"), so both nodes will make the same decision to fence the other node. To break this symmetry, you need a quorum disk daemon or a third node as a tie-breaker.

In RHCL, fencing is the primary solution to the split-brain problem in the event cluster connections are lost, so the fencing connections are rather important. But it's not like "they must never fail"; the important thing is, they should not fail _at the same time_ as the connections that carry the cluster heartbeat.

If the cluster heartbeat connections are unlikely to fail completely, fencing becomes a rare event - like the cluster equivalent of activating the company's Disaster Recovery plan.

If you have more than one network switch available, connect each cluster node with two NICs to two switches and use bonding in Linux to link the physical NICs to a single aggregate interface.

If a node is an active member of a cluster, you don't mess with its network interfaces - the networking is a critical part of the cluster after all.

If you need to restart networking on a node, you'd better make the node leave the cluster in a controlled fashion first. Otherwise the remaining cluster will definitely try to fence the isolated node. (And in a two-node cluster with no tie-breaker, there is no way for the cluster to determine *which* node is the isolated one, so both nodes will try to fence each other.)

In a RedHat cluster, to make the node leave the cluster in a controlled fashion, stop the cluster services in the correct order:

service rgmanager stop
service gfs stop # if GFS is used
service clvmd stop # if CLVM is used
service cman stop

Stopping rgmanager should cause all the services still running on the node to automatically be moved to other nodes.
When you run "service cman stop", the /etc/init.d/cman script runs the appropriate commands to notify the other nodes that this node is leaving the cluster. This makes the other node(s) stop expecting any heartbeat messages from this node, so fencing won't be triggered if the node vanishes from the network afterwards.

Rejoining the cluster is as simple as restarting the cluster services in the reverse order:

service cman start
service clvmd start # if necessary
service gfs start # if necessary
service rgmanager start

(With HP Serviceguard, a single cmhaltnode/cmrunnode command does the same thing... you might want to write a little script to start & stop the cluster services with a single command.)

MK
MK
Alzhy
Honored Contributor

Re: node fencing

Matti.. As usual - thanks!

So you suggest in a 2 node RHCS cluster -- either add a 3rd tie breaker node (for show really and does not have to carry any service/package) OR a quorom disk.

Do you have a quick howto on how to set up a quorom disk sir?

Thanks!

N.
Hakuna Matata.
Matti_Kurkela
Honored Contributor

Re: node fencing

You'll find a lot if you Google with keywords "redhat cluster quorum disk":

http://www.linuxdynasty.org/howto-setup-a-quorum-disk.html

http://magazine.redhat.com/2007/12/19/enhancing-cluster-quorum-with-qdisk/

In general, the CMAN FAQ has a lot of good advice too:

http://sources.redhat.com/cluster/wiki/FAQ/CMAN
http://sources.redhat.com/cluster/wiki/CategoryHowTo

The man pages for the qdisk component are quite good too:
man 5 qdisk
man 8 qdiskd
man mkqdisk

MK
MK
Alzhy
Honored Contributor

Re: node fencing

I used the Linux dynasty URL recipe to set up my RHCS cluster. I did not notice there was also a nicely (and simply) written brew for Quorom Disk... awesome -- problem solved.

Thanks Matti.
Hakuna Matata.