Operating System - OpenVMS
1827757 Members
3068 Online
109969 Solutions
New Discussion

Re: 2 node cluster + quorum node - network breaks, what happens?

 
SOLVED
Go to solution
MarkOfAus
Valued Contributor

Re: 2 node cluster + quorum node - network breaks, what happens?

Jan,

"What amount of data are we talking?
Disk space is pretty cheap these days, so the setup where you have your DATA triplicated (shadowed) over 3 sites should be affordable, if node C has the slot capacity."

Unfortunately the node C has no such capacity; the constraints remain of no extra computing power (read: money) will be thrown at this "system" irregardless of its importance, it seems.

"A compromise might be to have your system control files (SYSUAF, RIGHTSLIST, QUEfiles, etc) and the programss stuff, and whatever more, shadowed only between A and B, and have only the important application DATA (or even only the subject-to-change part of that) on all 3 sites."

Hmm, this sounds like a real option. I hadn't thought of that.
MarkOfAus
Valued Contributor

Re: 2 node cluster + quorum node - network breaks, what happens?

Hoff,

"Which cluster member node drops out depends on which link drops out."

And herein was the question, ie, specifically the link between A and B, the two nodes performing all the work, not node C which is doing nothing else but being a quorum voter.


"Not on which node is a "quorum node". For the purposes of the quorum scheme itself, there's no specific concept of a "quorum node", it's just another voting node. The concept is helpful in some cases, but it appears to be confusing things here."

Not in the least. If a three node cluster is supposed to be better than a 2 node one, then it is because it is affording the ability of the cluster to stay "up" when one node goes down or one link goes down, surely?

With A & B linked together, if the link goes down, A or B continue; no data loss. If node B, for example, goes down, the whole cluster hangs; no data loss. If introducing C was to help keep the cluster up the trade-off seems to be possible data loss. I guess that is the crux of the argument. So it is how to mitigate that situation.

In a nutshell is there only the viability of keep it running, possibly lose data or let it hang, don't lose data?


"If Node A can (for instance) contact Node B but not C, and if B and C are connected, then A will halt processing until the timers fire and then drop out. If B and C have enough votes, they'll continue processing after the cluster transition."

This is presuming C has more votes than A and B?

Yes, that's the easy scenario, mine was A connected to C AND B connected to C. From that presumption, it was which combination mutates to being the "cluster"?

Martin Hughes
Regular Advisor

Re: 2 node cluster + quorum node - network breaks, what happens?

Mark,

In the event of a link failure between nodes A and B, but not C, connection manager will halt processing and then either A or B will crash. This prevents the cluster being partitioned.

Are your disks mirrored between sites using host-based shadowing?.
For the fashion of Minas Tirith was such that it was built on seven levels, each delved into a hill, and about each was set a wall, and in each wall was a gate. (J.R.R. Tolkien). Quote stolen from VAX/VMS IDSM 5.2
MarkOfAus
Valued Contributor

Re: 2 node cluster + quorum node - network breaks, what happens?

Martin,

"In the event of a link failure between nodes A and B, but not C, connection manager will halt processing and then either A or B will crash. This prevents the cluster being partitioned."

Ah, yes, but which one. And in the time between it deciding, is there data loss potential.

"Are your disks mirrored between sites using host-based shadowing?."

Yes, except system disks.
Martin Hughes
Regular Advisor

Re: 2 node cluster + quorum node - network breaks, what happens?

>>And in the time between it deciding, is there data loss potential.<<

Ok, so nodes A and B have local storage at each site, mirrored through host based shadowing. Each server presents its local disks to the remote server via MSCP.

When the link between A and B breaks, node A can no longer see the disks presented by node B, and node B can no longer see the disks presented by NODE A. You cannot complete a write operation since you cannot write to both disks. I'd expect that the first write operation attempted to any volume would result in that volume being placed in "Mount Verify" status, effectively suspending all I/O.

Meanwhile, connection manager detects the problem and either node A or node B crashes. For arguments sake lets say node A crashes. Node A's local disks are removed from the shadowsets, and node B continues processing with valid data on its local disks.

At least that is how I understand it. I can't remember how often connection manager polls and thus how long it will take connection manager to detect the broken link and halt processing. But I don't think it matters, since your I/O will already be suspended, and you are not at risk of data corruption.
For the fashion of Minas Tirith was such that it was built on seven levels, each delved into a hill, and about each was set a wall, and in each wall was a gate. (J.R.R. Tolkien). Quote stolen from VAX/VMS IDSM 5.2
MarkOfAus
Valued Contributor

Re: 2 node cluster + quorum node - network breaks, what happens?

Hi all,

First, thank you very much to all those who contributed to this lively conversation.

In summary, I have deduced the following in this theoretical 3-node cluster:

1. Any form of automatic quorum adjustment (letting the cluster decide via votes) leads to the famous term "creeping doom failure". Data loss is possible, even probable, in this situation.

2. When/If the link fails between the primary and secondary nodes, the only way to regain quorum for the cluster is to manually intervene, thus preventing possible data loss. Not my choice, but inevitable it seems.

3. A HBVS set-up is also a way to ensure no data is lost when/if the link between the primary node and the secondary node is dropped. This is providing the "manual intervention" model is used.

Once again, thank you for your time and effort in clarifying these issues for me.

Thanks
Mark
Martin Hughes
Regular Advisor

Re: 2 node cluster + quorum node - network breaks, what happens?

Mark,

I think you are confusing some points still.

1. As I understand it, creeping-doom refers to the scenario where (as an example) you have a fire at site A, which knocks out your link between node A and B, and shortly after the fire takes out node A. The point being that you cannot 100% determine which node will remain up in the event of a link failure between nodes A and B (other than to say that C will remain up).

2/3. Not sure exactly what you mean by manual intervention. The problem in this scenario is that you actually have quorum still, but the cluster is partitioned due to the link failure. Connection Manager will automatically crash either node A or B to resolve the situation.
For the fashion of Minas Tirith was such that it was built on seven levels, each delved into a hill, and about each was set a wall, and in each wall was a gate. (J.R.R. Tolkien). Quote stolen from VAX/VMS IDSM 5.2
MarkOfAus
Valued Contributor

Re: 2 node cluster + quorum node - network breaks, what happens?

Martin,

"I think you are confusing some points still.

1. As I understand it, creeping-doom refers to the scenario where (as an example) you have a fire at site A, which knocks out your link between node A and B, and shortly after the fire takes out node A. The point being that you cannot 100% determine which node will remain up in the event of a link failure between nodes A and B (other than to say that C will remain up)."

I think you're sort of correct, except one thing. In your example, the connection manager may actually decide to kill off B. Then, the fire takes hold and takes out A. Your changes (writes) in the interim to A may be lost or totally will be, in fact, if the fire engulfs your server.
Doom! :-)


"2/3. Not sure exactly what you mean by manual intervention. The problem in this
"

If you must manually intervene then the cluster cannot continue to write data to node A or node B, instead the entire cluster hangs. No data lost if the fire takes hold of Node A.


Alas, I would love for their to be a solution that involves no hands-on, but there does not seem to be.
Jan van den Ende
Honored Contributor

Re: 2 node cluster + quorum node - network breaks, what happens?

Mark,

>>>
"creeping doom failure". Data loss is possible, even probable, in this situation.
>>>
No, that certainly is NOT probable, but alas, it is also not impossible.
It requires the loss of a site (in itself already not too common a fenomenon), but NOT "in a single stroke". The course of events must specifically be such, that the trouble site (say, A) LOOSES connectivity to B, but NOT to C, and, the node stays up. Long enough to make "significant" changes to the "important" data. After that, node A and/or connectivity also go "out". Note that the initial loss of Node A is no problem, instant loss of site A is no problem.
Initial loss of A-C connectivity will mean clueexit of node A (B & C continuing, everything OK) or cluexit of C (shadowing between A and B maintains data integrety; later loss of node A does not result in data loss.
And again, 3-site shadowing of the vital data would prevent loss of vital data.

All in all: the creeping doom scenario is highly unlikely, BUT: NOT impossible. Again, it boils down to the damage you risk, and the "insurance fee" you are willing to pay to prevent (or at least mitigate) that damage.

hth

Proost.

Have one on me.

jpe
Don't rust yours pelled jacker to fine doll missed aches.
Hoff
Honored Contributor

Re: 2 node cluster + quorum node - network breaks, what happens?

:"In the event of a link failure between :nodes A and B, but not C, connection manager :will halt processing and then either A or B :will crash. This prevents the cluster being :partitioned."
:
:Ah, yes, but which one.

With A1-B1-C1 voting, that's likely going to be officially indeterminate AFAIK/AFAICR.

With A3-B2-C2, it should be B that exits.

:And in the time between it deciding, is there :data loss potential.

No.
Thomas Ritter
Respected Contributor

Re: 2 node cluster + quorum node - network breaks, what happens?

So what happens if the cluster is shut down and A and B cannot see each other, but A sees C and B sees C ? Is this a bit like a quorum disk with parallel paths ?
Thomas Ritter
Respected Contributor

Re: 2 node cluster + quorum node - network breaks, what happens?

Forgot to add ...Both A and B are booted. Under this configuration could not A and B form their own cluster if either A or B can see C ?
MarkOfAus
Valued Contributor

Re: 2 node cluster + quorum node - network breaks, what happens?

Thanks all for your assistance, it was much appreciated.