Operating System - OpenVMS
1839190 Members
4067 Online
110137 Solutions
New Discussion

Re: Cluster suspended while one member had a defect

 
SOLVED
Go to solution
Kirsten Knüttel
Frequent Advisor

Cluster suspended while one member had a defect

Hello,

I had a problem where I can't find the solution. So I hope one of you can help me.

I have a cluster with 4 members (each member got 1 vote, expected votes=3, no quorum disk). Last week, the fan of the CPU of one cluster member had a defect, so this machine turned out.
In my opinion, the rest of the cluster had to run normal. But it seemed as if the other cluster members suspended. They couldn't be reached, even on the console you couldn't do anything. They worked again (without reboot) when the broken cluster member was back.
So, what could be the problem of it?

Regards,

Kirsten
10 REPLIES 10
Karl Rohwedder
Honored Contributor
Solution

Re: Cluster suspended while one member had a defect

A four member cluster with one vote for each member should have EXPECTED_VOTES set to 4, which yields to a quorum of 3. So in your case the remaining nodes should be able to continue, if one node fails.
How is storage organizes in your cluster, do the remaining nodes have access to vital disks (e.g. the systemdisk), or are some disks served by the failing node?
If the quorum was lost, there should be messages on the consoles or in the OPERATOR.LOG.

regards Kalle
labadie_1
Honored Contributor

Re: Cluster suspended while one member had a defect

Hello

Do you have a Quorum disk ?
Can you post the votes of all the members ?

It seems the number of votes was under the quorum, so it may explain why the cluster hang.

It is a pity that you do not have AMDS or Availability Manager, as it tells you the quorum is not reached, and you can force a new value for the quorum, so the Cluster is again working.
Kirsten Knüttel
Frequent Advisor

Re: Cluster suspended while one member had a defect

Hi,

The system disk is reachable from all machines in the cluster, so this couldn't be the problem.
It was a bit of mystique for me. In the moment, one cluster member was broken, all the other machines suspended. No entrie in the operator log for the reason. They worked again, when the broken cluster member was back. I couldn't understand this.

Short summary: 4 nodes in a cluster, no quorum disk, expected votes 3, each machine 1 vote.

Regards,

Kirsten
Karl Rohwedder
Honored Contributor

Re: Cluster suspended while one member had a defect

Kirsten,

at least some lost-connection... messages should be in the OPERATOR.LOG.
Can you give some more background on your configuration, e.g. storage, interconnects...

regards Kalle
Mrityunjoy Kundu
Frequent Advisor

Re: Cluster suspended while one member had a defect

For cluster the formula of quoram value is
=(expected_values/2 +1) rounded down.
then the calculated value of quorum in your scenario is =3/2+1 =2.5 rounded down i.e. 2
So when in your cluster, if atleast two nodes alive ,your cluster should be up.I think you should check sysgen parameters (votes,expected_votes,QDSKVOTES) and also modparams.dat file.
Karl Rohwedder
Honored Contributor

Re: Cluster suspended while one member had a defect

The EXPECTED_VOTES is used during initial boot to determine the QUORUM to allow for cluster funtionality. If this value is not correct, esp. too low you are risking cluster fragmentation. If during normal systemstate, the number of votes exceeds EXPECTED_VOTES, the quorum is raised above. So in this case, when 4 nodes are contributing 4 votes, the QUORUm raises to 3.

regards Kalle
Volker Halle
Honored Contributor

Re: Cluster suspended while one member had a defect

Kirsten,

your votes configuration is correct. For 4 nodes, the majority (i.e. QUORUM) is 3, so the cluster should continue, if only one node is lost.

It may be too late to find out, why the clsuter has apparently hung. Do you capture your console data with some console manager application ? If not, there should be at least some messages in OPERATOR.LOG - written once the 4th system came back again.

If this would happen again - and if it really has something to do with lost quorum, you could try the IPC interrupt on the console to recalculate quorum.

Volker.
Volker Halle
Honored Contributor

Re: Cluster suspended while one member had a defect

mrityunjoy kundu,

your formula is wrong.

The correct formula for calculationg quorum is:

quorum = (expected_votes+2)/2

In this case, (4+2)/2 gives 3, which is the correct quorum value for a 4 votes.

Volker.
Dean McGorrill
Valued Contributor

Re: Cluster suspended while one member had a defect

hi Kirsten,
check and see if the votes/expected votes etc are what you think they are. when its running, do a..

$ show cluster/continous
add vote
add quorum
add cluster

that will show what the running cluster has.
see if that makes sense. Dean
Hoff
Honored Contributor

Re: Cluster suspended while one member had a defect

Quorum here should be 3, but the running value here could be as low as 2. Which would be bad.

Expected_Votes -- per the original posting -- is set incorrectly. If connectivity is not available (due to a console configuration error or due to a partial communications disconnection), then the Expected_Votes set to 3 will result in Quorum being calculated as 2, which could then allow two disjoint partitions to operate in parallel, and with the data corruption that typically then ensues.

If you wish to preserve the integrity of your disk data, Expected_Votes should be set to 4, and not to 3.

http://64.223.189.234/node/153

Personally, I view the existing quorum mechanism implemented with system parameters as a design mistake. Far too often, somebody either sets the values incorrectly, or sets the values "creatively"; deliberately and erroneously sets their configuration incorrectly.

The central rational for existence for the cluster quorum scheme is to prevent your data from getting stomped on. It's not something you want to mis-set, lest you allow your data to get stomped on. And by "stomped on", I here mean "massively corrupted; how current is your BACKUP?", or such.

Stephen Hoffman
HoffmanLabs LLC