Operating System - Linux
1846474 Members
2328 Online
110256 Solutions
New Discussion

Fence Hang when detach power supply, Urgent

 
SOLVED
Go to solution
maxi_1
Occasional Contributor

Fence Hang when detach power supply, Urgent

Hi boys!

I've a big problem.
I've installed RH enterprise 5.2 in cluster modality on 2 hp dl380 server, and I've decide to use the fence ilo to fence server.

This solution work very well if I detach ethernet cable, or manual "switch" (clusvcadm) services from a server to another and so on.

But if I detach the 2 cable from the power supply from one server (that go down :-)), the up server start to try to fencing the server that is down and don't understand that the other is down!!!

So the fencing does not work (is normal the other on hasn't no electricity cable on), and so the up server wait wait and wait... and the services that was on the down server don't "switch" because the cluster wait for fencing.

On the up server log I get:

mhs2 fenced[4137]: fence "mhs1.local" failed
mhs2 fenced[4137]: fencing node "mhs1.local"

and other line that I've attached on "error.txt" attach.

Only when I restart the down server, the cluster restart to works correctly.

Please, HELP ME!!!!!!

Bye
mb from Italy



7 REPLIES 7
Brem Belguebli
Regular Advisor

Re: Fence Hang when detach power supply, Urgent

Hi,

I don't know much about RH cluster, but reading your post I'd suggest you open a bug report at Redhat, as the behaviour you describe looks like a buggy feature.

Is there any "out of band" heartbeat in RH cluster that you may be lacking which may cause the fencing to hang ?
macosta
Trusted Contributor

Re: Fence Hang when detach power supply, Urgent

If you remove all server power (unplug both power supplies, assuming you have redundant power,) you are confronting the cluster with a multiple-failure scenario.

If the server chassis has no power, the iLO also has no power. The remaining cluster nodes have no way of knowing that the server is powered off, since communication has been severed. The cluster only knows that it cannot reach the node via the network.

If you shut the node down cleanly, the server should leave the cluster in a clean state and not need to be fenced.

I am not as familiar with RH Cluster Suite, but Polyserve works the same way.
maxi_1
Occasional Contributor

Re: Fence Hang when detach power supply, Urgent

But are you sure that is a normal functionality of a cluster?

So if a Server go down in a 2 component cluster, the other one can't operate a "switch" of the services of the down server?

Is no good for us....

I attach my cluster.conf
Brem Belguebli
Regular Advisor

Re: Fence Hang when detach power supply, Urgent

Another thing you should have in a 2 nodes cluster is a tie-breaker.

I think RHCS uses a lock lun which is used as a quorum in case one of the node fails.


macosta
Trusted Contributor
Solution

Re: Fence Hang when detach power supply, Urgent

maxi,

Yes, that is what I expect. If you're removing multiple power sources, you're simulating a larger failure than it can handle. Most clusters prevent against a single point of failure. You're introducing at least 2 points of failure (double power failure or supply failure.)

You're powering off the server as well as the iLO, which it needs to fence.

If you want to simulate something more likely, like a kernel panic, bringing the node down, I'd suggest actually crashing the kernel, such as using the Alt-SysRq trigger. Then the node will be properly fenced and activity should continue if it's configured correctly.
Emil Velez
Honored Contributor

Re: Fence Hang when detach power supply, Urgent


Agree with previous posters.

Lets look at this. Polyserve fences a node if there is a network outage where the nodes in the cluster cannot communicate to one of the other nodes. Until it fences the node that is not talking the rest of the cluster does not want to generate i/o to the shared cluster file system since they do not know that the node that is not responding is down and not writing to the disk too which will corrupt data.

So if a node stops talking to the cluster we fence them to make sure they are down and not accessing the disks.

When you killed the power from the cabinet you stopped the node from commuicating with the rest of the cluster plus you prevented the rest of the cluster from fencing the node.

Server fencing is to shut down the node if the OS fails but the ilo works.

There is a command you can execute

mx server markdown

on one of the nodes to say the node was fenced but this should only be done if the node really is down for another reason.

You should make sure the cabinet has multiple sources of power and does not loose all power.

Steven E. Protter
Exalted Contributor

Re: Fence Hang when detach power supply, Urgent

Shalom,

I think this may be a flaw in the fence software.

You may wish to test this with fence from the recently released RHEL 5.3. Other rpms may also be required.

SEP
Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com