SG panics when half cluster is down.

CAS_2 · ‎06-29-2005

Hi

I have a SG cluster with 4 nodes. Last Monday I installed patches on two nodes (1 and 2) of that cluster. Previously, I had switched packages to other nodes (3 and 4). I didn't run cmhaltnode on nodes 1 and 2.

When nodes 1 and 2 rebooted due to patching, nodes 3 and 4 panic'ed together. The panic message is

Reboot after panic: SafetyTimer expired, isr.ior = 0'10340005.0'f83e01d8

Thus, the HA services went hell and couldn't recover until 15 or 20 min later because of crash dump and startup.

Since a panic was reported I opened a call in HP. I believed this was a patch issue.
But HP answers this is the LOGICAL behaviour of SG clusters in order to prevent a split-brain.

I am indignant about this.
Do you think that TOCing the surviving nodes in a cluster is a logical behaviour ?

I agree that, in a split-brain, SG cluster should stop services BUT NOT REBOOT nodes.

I managed SG cluster with 2 nodes. In all those cases, when a node crashed due to CPU or memory failure, the surviving node never paniced and rebooted.

Duncan Edmonstone · ‎06-29-2005

As your cluster is a 4 node cluster you may not have a cluster lock disk or cluster quorum server - is this the case?

Assuming it is, then the behaviour of your cluster is to be expected.

Your cluster couldn't achieve quorum (exactly half - 50% - of the nodes were still up) so it didn't know and couldn't know what the state of the other two nodes were (it would have done if you had run a cmhaltnode, beacuse the nodes would have gracefully left the cluster)

SO in this situation Serviceguard MUST protect your data first, and worry about availability second.

Your idea of stopping services sounds great, but how long does that take, and doesn't stopping an app usually involve writing data to disk? In these scenarios its just not possible to know that stopping services won't actually corrupt our data (we don't know what the other two nodes are doing remember.) - so the only safe course of action is to stop everything DEAD - and Serviceguard does that by TOC'ing the box.

Of course if you DO have some form of cluster lock (lock disk or quorum server) - then we need to look again at what happened.

If you had rebooted just 1 node at a time you wouldn't have had this problem as quorum (>50% of nodes) would have been maintained.

HTH

Duncan

I am an HPE Employee

melvyn burnard · ‎06-29-2005

Well if you did not halt the cluster services on the two nodes, and you installed patches that require a reboot, then unless you have a cluster lock disc or quorum server configured this would be expected behaviour.
The reboot did a kill -9 on the cmcld process, which is seen as a failure, and this happened on two out of four nodes, leaving you 2 nodes, or 50%. Serviceguard REQUIRES more than a 50% quorum upon a failure scenario, or exactly 50% with access to a cluster locking mechanism.

The correct procedure would be to cmhaltnode the two nodes to be patched, and this would allow the other two nodes to reform as a two node cluster under non-failure conditions.
This would also prevent the nodes being patched doing a TOC and dump.

My house is the bank's, my money the wife's, But my opinions belong to me, not HP!

Bernhard Mueller · ‎06-29-2005

Hi,

I side with HP, and recommend you do some reading about "arbitration". It is all available on the HP docs website.

Rebooting two nodes at the same time, was not a *single" failure (like one node in a two-node cluster). Still you could have prevented this by running cmhaltnode before patching them. Then the cluster would have known the nodes *are* down. And with that knowledge you could have even stopped the third node as well and the remaining node would have known the others are down. After that they could have re-entered the cluster.

Your cluster recognized that it did not have *more* than half of the nodes and since it did not know what was going on, the nodes *needed* to go down because if the other half of nodes tried to bring up packages you need to make sure they can, but if you loose access to disks you cannot deactivate a VG and then the other node would not be able to activate exclusively....

Regards,
Bernhard

CAS_2 · ‎06-29-2005

Duncan wrote:

Your idea of stopping services sounds great, but how long does that take, and doesn't stopping an app usually involve writing data to disk? In these scenarios its just not possible to know that stopping services won't actually corrupt our data (we don't know what the other two nodes are doing remember.

In my case, I switched all packages to nodes 3 and 4 and I disabled switching prior to install patches.

Matti_Kurkela · ‎06-29-2005

When exactly one half of the cluster nodes vanishes from the view of the other half, there can basically be two reasons for that:
a) the nodes have failed
b) the nodes might be OK, but the network between them has failed.

But how the cluster can determine which of these two situations it is? There must be some sort of a tie-breaker, otherwise the cluster has NO WAY of knowing which of these situations it is. In SG this tie-breaker is either a lock disk or a quorum server.

In the situation a) it is obvious that the remaining nodes should continue processing and claim for themselves the packages from the failed nodes, if possible.

But in the situation b) the situation is more dangerous. From the point of view of the nodes 1 and 2, the nodes 3 and 4 have failed and their services should be moved to nodes 1 and 2.

However, from the point of view of the nodes 3 and 4, the nodes 1 and 2 have failed... which means that BOTH groups of nodes will attempt to claim the other group's packages' IP addresses and disks. Since there is no connectivity between the groups of nodes, they will succeed, potentially causing data corruption in EVERY package that can be moved between the groups.

So, to answer your question: when there is a possibility that there is another node mounting the same disks "this" node is using and there is no way to communicate to that node, yes, the only way to avoid data corruption is to stop immediately. The TOC is the fastest method to do that.

War story:
We recently reconfigured all our SG clusters to use a quorum server, to make our major storage system upgrade more painless - no cluster restarts for lock disk reconfiguration.

However, after that the UPS system of one of our two major server rooms malfunctioned. There was a small fire, and the firemen had to power down the UPS system... which caused a total power outage to one server room.
No problem, the services should fail over to the servers in the other room, right...?

But it happened that the quorum server was in the server room that was having the power outage... so all our production SG nodes made a TOC, since each cluster lost exactly one-half of its nodes AND the quorum server was unreachable at the same time.

When the power was back, it took about eight hours to restore the production systems (during a Saturday evening) and more time on Monday to restore the test/development/noncritical systems.

We also had a configuration error where one single FibreChannel disk was simultaneously used in two unrelated systems. One of the systems had a lot of unused space in its VG, so the problem did not show up until both systems actually started writing to the desk. That took about two weeks from the initial configuration error.

Sorting out the resulting mess (make copies of the affected disk, verify the correctness of the data, find and restore the corrupted data from backups and/or regenerate from raw data archived elsewhere) took about 24 hours. Of course, the problem was noticed on a Friday afternoon...

MK

CAS_2 · ‎06-29-2005

I feel misunderstood ;-(

Rita C Workman · ‎06-30-2005

..how so ? By SG or by what everyone is trying to explain to you.

...they are trying to tell you that you created the situation, and force SG (the way it was done) to take the action it did.

So...do you have a lock disk ? or a Quorum Server running ? If you had either one of these working properly, you wouldn't have lost your cluster.

...or...as they have said, if you would have simply run a cmhaltnode on each server B4 you bagan patching it. Your cluster would not have failed, because the cluster would have known that those boxes were now on the outside of the cluster. Then when you were done patching & rebooting just run cmrunnode.

Or is there something else ???

Hang in there SG can be a bit overwhelming at times. But you'll get it !

Rgrds,
Rita

A. Clay Stephenson · ‎06-30-2005

Well, you can be angry all you like but MC/SG did just what it was supposed to do and what it is documented to do. The fault is not HP's but yours because you didn't bother to properly halt the nodes.

You can't compare a crashed node to your situation and you can't compare a two-node cluster either because it requires a lock disk or quorum server.

This was nothing more and nothing less than pilot error; learn from this and move on.

If it ain't broke, I can fix that.

Stephen Doud · ‎07-01-2005

I'm sorry that you had this unfortunate experience.

The logic behind Serviceguard is explained in the "Managing Serviceguard" manual, which is available to anyone on the internet at this location:
http://docs.hp.com/en/ha.html#Serviceguard

The manual section that describes the behavior of Serviceguard when half of the nodes leave unexpectedly is documented in the section titled:
"Cluster Quorum to Prevent Split-Brain Syndrome"
--- which is at this link:
http://docs.hp.com/en/B3936-90065/ch03s02.html#d0e1810

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

SG panics when half cluster is down.

SG panics when half cluster is down.

Re: SG panics when half cluster is down.

Re: SG panics when half cluster is down.

Re: SG panics when half cluster is down.

Re: SG panics when half cluster is down.

Re: SG panics when half cluster is down.

Re: SG panics when half cluster is down.

Re: SG panics when half cluster is down.

Re: SG panics when half cluster is down.

Re: SG panics when half cluster is down.