- Integrated Systems
- About Us
- Integrated Systems
- About Us
03-30-2011 08:39 PM
We are having a cluster environment on two different HP blade. We found both the nodes fenced each other and went down with some errors. Attached is the syslog of both the nodes. Can anyone explain the exact reason why both the nodes are went down simultaneously.
Solved! Go to Solution.
03-31-2011 04:03 AMSolution
To resolve the split-brain situation, each node attempts to fence the other node (on 11:10:56) but the fence connection fails too.
Before the cluster has time to do anything else, on 11:10:57 the network connections between the nodes start working again.
But as both nodes have already decided "I'm the surviving node, and the other one has problems", the nodes are out of sync of each other. To remedy that, openais sends a "you're out of sync - please reboot" message to the other node:
> Mar 25 11:10:57 bgw-node1 openais: [MAIN ] Killing node 192.168.11.69 because it has rejoined the cluster with existing state
The problem is, the openais on the other node has made the same decision and done exactly the same on node 1:
> Mar 25 11:10:57 bgw-node1 openais: [CMAN ] cman killed by node 2 because we rejoined the cluster without a full restart
This causes node 1 to begin a controlled shutdown:
> Mar 25 11:10:58 bgw-node1 shutdown: shutting down for system halt
But meanwhile, node 2 put in a fencing request for node 1 at 11:10:53. At 11:11:34, fenced on node 2 confirms node 1 has been fenced. But because node 1 sent a kill request to node 2's cman at 11:10:57, there is no running cman any more on node 2, because node 2 is performing a shutdown:
> Mar 25 11:11:34 bgw-node2 fenced: Unable to connect to to cman: Connection refused
On node 2, the clvmd is waiting for DLM locks to clear... but because node 1 has been fenced, and node 2 is shutting down and has already stopped the dlm_controld daemon, there may be some difficulties with that :-)
At 11:14:00, the kernel notices clvmd has has been blocked in a (DLM-related) kernel function for more than 120 seconds and produces a call trace for it.
Mar 25 11:10:57 bgw-node2 dlm_controld: cluster is down, exiting
Mar 25 11:14:00 bgw-node2 kernel: INFO: task clvmd:6577 blocked for more than 120 seconds.
At about 11:15:55, node2 completes its shutdown and begins to reboot.
After node 1 has rebooted, at 11:14:56, node 1 again cannot detect node 2 (because node 2 is still in the process of shutting down), so after waiting for a while, it decides to fence node 2 at 11:15:45. This time, the fencing operation is successful, and the fence daemon on node 1 confirms successful fencing of node 2 at 11:16:26.
At this point, node 1 can be sure it's the only active node in the cluster.
At 11:19:54, node 2 has completed rebooting and rejoins the cluster.
On Mar 25 12:23:18, the network connection between the nodes was lost again. At 12:23:38, both nodes decided to fence each other.
At 12:23:40, the network connection comes back and the nodes again attempt to re-form the cluster, but it's too late: both nodes have again already decided they're the only surviving node. Again each node sends an openais kill request to the other node, and both nodes start a controlled shutdown. This time, neither node successfully fences each other before the cluster daemons are shut down by the openais kill requests.
Apparently both nodes then remain down until node2 is restarted on Mar 26 13:55:20. It forms a cluster alone at 13:55:26. Since the status of node1 is unknown, node2 fences it at 13:56:14. The fencing is confirmed successful at 13:56:28. Node2 starts the bgw service at 13:57:27.
Node1 starts up at 13:59:44. It joins the cluster at 13:59:51. At 14:01:01, node1's startup is complete and bgw service is automatically moved to node1.
I'd say the main problem with your cluster is that both the cluster communication ("heartbeat") connection and the fencing connection are failing intermittently, sometimes simultaneously. A simple fail-over cluster is typically guaranteed to protect against one fault at a time only: such a cluster usually can behave sensibly in many cases where there are two or more faults, but not all possible cases.
(A guaranteed protection against all possible combinations of two simultaneus failures requires multiply-redundant connections and *very* careful cluster planning. It is usually too complicated and expensive unless the system is actually life-and-death critical, like fly-by-wire systems of an airplane.)
You should take whatever steps are necessary to make sure the cluster network connections and fencing connections won't be interrupted at the same time.
If all your network connections go through a single switch, you may experience a short but total network blackout each time the network administrator resets the switch for configuration change or firmware update. In that case, the redundancy of your network connections needs to be improved. For example, connect each node to two separate (but linked) switches, using two network interfaces bonded together in active/backup mode.
A two-node RHEL cluster can be subject to fencing races, and can be stuck in reboot-and-fence cycles.
To solve that, you need a tie-breaker. One way to do that is to add a third node to your cluster. The third node only needs to be part of the cluster, to participate in the cluster quorum decisions: it does not have to actually run any cluster applications.
Another possible tie-breaker solution would be to configure a quorum disk daemon (qdiskd).
You have both device-mapper-multipath and EMC PowerPath installed on both nodes. This might cause the disk errors: when device-mapper-multipath takes control of a multipathed device, it tries to lock the corresponding /dev/sd* devices so the userspace programs cannot use them. I guess PowerPath might do something similar. Then smartd tries to examine all /dev/sd* devices... and fails.
Don't use both device-mapper-multipath and EMC PowerPath: pick one or the other. Also you might consider configuring smartd so that it monitors local disks only.
(If you have EMC Clarion SAN, then monitoring it with smartd is probably useless: the Clarion will internally monitor its disks, and will produce more meaningful hardware health information of its own. Since the Clarion LUNs don't necessarily have one-to-one relationship with actual Clarion disks, the SMART data on the LUNs is going to be "virtual" data generated by the Clarion anyway.)
03-31-2011 10:18 AM
Re: node fencing
BTW, how critical are fencing communications? Since ILOs (and other management modules) undoubtedly cannot have a teamed/redundant network?
I am in the process of putting RHCS through its paces as there's a possibility we could drop Poor Man's CLustering in lieue of RHCS for simple failovers. One issue I am facing is a simple network restart on one of the nodes on my 2 node cluster forces a fence on both the nodes. How can this be avoided?
Sorry for hi-jacking your thread Kranti.
03-31-2011 11:12 PM
Re: node fencing
>One issue I am facing is a simple network restart on one of the nodes on my 2 node cluster forces a fence on both the nodes. How can this be avoided?
If both nodes get fenced, this suggests you aren't using a quorum disk daemon nor any other tie-breaker in your cluster. You've essentially have the same problem Kranti has: the problem looks exactly the same to both nodes (i.e. "the other node does not respond any more"), so both nodes will make the same decision to fence the other node. To break this symmetry, you need a quorum disk daemon or a third node as a tie-breaker.
In RHCL, fencing is the primary solution to the split-brain problem in the event cluster connections are lost, so the fencing connections are rather important. But it's not like "they must never fail"; the important thing is, they should not fail _at the same time_ as the connections that carry the cluster heartbeat.
If the cluster heartbeat connections are unlikely to fail completely, fencing becomes a rare event - like the cluster equivalent of activating the company's Disaster Recovery plan.
If you have more than one network switch available, connect each cluster node with two NICs to two switches and use bonding in Linux to link the physical NICs to a single aggregate interface.
If a node is an active member of a cluster, you don't mess with its network interfaces - the networking is a critical part of the cluster after all.
If you need to restart networking on a node, you'd better make the node leave the cluster in a controlled fashion first. Otherwise the remaining cluster will definitely try to fence the isolated node. (And in a two-node cluster with no tie-breaker, there is no way for the cluster to determine *which* node is the isolated one, so both nodes will try to fence each other.)
In a RedHat cluster, to make the node leave the cluster in a controlled fashion, stop the cluster services in the correct order:
service rgmanager stop
service gfs stop # if GFS is used
service clvmd stop # if CLVM is used
service cman stop
Stopping rgmanager should cause all the services still running on the node to automatically be moved to other nodes.
When you run "service cman stop", the /etc/init.d/cman script runs the appropriate commands to notify the other nodes that this node is leaving the cluster. This makes the other node(s) stop expecting any heartbeat messages from this node, so fencing won't be triggered if the node vanishes from the network afterwards.
Rejoining the cluster is as simple as restarting the cluster services in the reverse order:
service cman start
service clvmd start # if necessary
service gfs start # if necessary
service rgmanager start
(With HP Serviceguard, a single cmhaltnode/cmrunnode command does the same thing... you might want to write a little script to start & stop the cluster services with a single command.)
04-01-2011 06:52 AM
Re: node fencing
So you suggest in a 2 node RHCS cluster -- either add a 3rd tie breaker node (for show really and does not have to carry any service/package) OR a quorom disk.
Do you have a quick howto on how to set up a quorom disk sir?
04-02-2011 03:38 AM
Re: node fencing
In general, the CMAN FAQ has a lot of good advice too:
The man pages for the qdisk component are quite good too:
man 5 qdisk
man 8 qdiskd
04-04-2011 06:17 AM
Re: node fencing