Migrating from Quorum Disk to Quorum Node

Jim Geier_1 · ‎04-20-2006

We have a three-node AlphaServer ES45 OpenVMS cluster running OpenVMS Alpha 7.3-2. Because we want the cluster to survive with only one system, we have a quorum disk with 2 votes, and expected_votes is set to 5. We are going to replace the quorum disk with a quorum node using a DS10 booting off its own, internal disk, running OpenVMS Alpha 8.2.

The steps I was planning on taking for the transition from quorum disk to quorum node are the following:

1. Change the modparams files for the three ES45 systems to indicate DISK_QUORUM = "" and QDSKVOTES = 0, leaving EXPECTED_VOTES at 5. Run Autogen to set the parameters.
2. Copy CLUSTER_AUTHORIZE.DAT from the ES45 system disk to SYS$COMMON:[SYSEXE] on the DS10.
3. Change EXPECTED_VOTES on the DS10 to 5 and VOTES to 2.
4. Shut down all systems.
5. Boot the DS10, expect it to hang waiting for one additional vote.
6. Boot one of the ES45 systems, and there should be a working cluster.
7. Boot the second and third ES45.

Am I missing anything in this plan?
What problems or issues might I expect to encounter?

Steven Schweda · ‎04-20-2006

How are these systems connected?

> [...] we want the cluster to survive with
> only one system [...]

So, in the new scheme, where's the quorum if
the DS10 and one ES45 go down?

What makes the DS10 better (more reliable?)
than the quorum disk?

I don't see why you'd do it, but your
procedure looks plausible.

Andy Bustamante · ‎04-20-2006

"Using an internal disk on the DS-10" does this mean a single scsi disk? For a quorum node, you may want to consider some form of redundancy on the system disk disk, either with shadowing or with hardware based raid.

Your plan looks good. Another, less immediate option, is to have Availablity Manager/AMDS running on the clustered nodes. You can force quorum to be recalculated on the fly. The downside of course being that this requires manual intervention, the cluster won't continue. If you outages on multiple nodes, it can be a useful tool.

Andy

If you don't have time to do it right, when will you have time to do it over? Reach me at first_name + "." + last_name at sysmanager net

Uwe Zessin · ‎04-20-2006

> So, in the new scheme, where's the quorum if the DS10 and one ES45 go down?

Its gone. Works by design ;-)
Learn about AV/AMDS/IPC to recover the cluster

> What makes the DS10 better (more reliable?) than the quorum disk?

A quorum disk requires multiple watchers that are directly connected to it. Last time I fiddled with it, it did not work nice on a shared parallel SCSI bus.
Too many I/Os to the disk cause failures, too.

.

Jan van den Ende · ‎04-20-2006

Jim,

think over Steven's remark twice (or maybe 3 or 4 times).
I can only think of ONE reason: your "production" systems are running some app(s) that are so flakey that THEY regularly crash your systems. (which in itself would be reason for redesign, but I also know situations where that deep desire is not an option).

Other than that, an uneven number of nodes with equal votes is the most stable config (nice mathematics excersise to prove that!)

A quorum node is really only a real gain if you have two active nodes (it has SOME advantage over a quorum disk), or, and most specifically, if you have 2 active SITES (providing the quorum node is at a third site).
If your active sites have more than one node, there are good reasons to have equal total votes PER SITE, spreading them evenly within each site. (Just use high-enough values per node to reach that condition).

Proost.

Have one on me (maybe in May in Nashua?)

jpe

Don't rust yours pelled jacker to fine doll missed aches.

Steven Schweda · ‎04-20-2006

> Its gone. Works by design ;-)

That's "it's", of course. I'm just trying
to see how the new scheme satisfies the
stated requirement.

> Learn about AV/AMDS/IPC to recover the cluster

I've done this. ("D/I 14 C" is stuck in my
head for some reason.) As above, I fail to
see how the new scheme satisfies the stated
requirement.

If manual intervention is allowed, who needs
more than the "the three ES45 systems"
(without even the quorum disk)?

Jim Geier_1 · ‎04-20-2006

The systems each have three GB network adapters (on different PCI busses). Two of those are connected to two Cisco 6509 switches. The third is connected to a Cisco 3550-12T that is not connected to the network (essentially acting as a Star Coupler). No SCACP channel prioritization is being done, VMS distributes the SCS traffic on the channels as it sees fit. SCS traffic is seen on all three channels, although more on the third channel that is dedicated to SCS.

Quorum disks are fine as long as they do not fail. I know it is not supposed to happen in the theory, but in my experience (and I have worked with clusters since the field test of VAX/VMS V4 in 1984) the typical scenario when a quorum disk fails is that the cluster becomes hung, and cannot be recovered without a complete cluster reboot. Not always, but far too often to overlook that possibility as a very likely and even expected outcome.

I don't expect that the DS10 will have a better MTBF than the quorum disk, but replacing the quorum disk with the DS10 will yield better performance, and I suspect a better recovery scenario when the DS10 fails than what has been experienced when a quorum disk fails.

Regarding the question about what happens when the DS10 AND an ES45 fail? In our current configuration, what happens when the quorum disk AND an ES45 fail at the same time? I don't see a real difference between those two scenarios.

Uwe Zessin · ‎04-20-2006

> replacing the quorum disk with the DS10 will yield better performance

Agreed, when I played with this, cluster state transitions went _much_ faster - even with a reduced quorum disk polling interval.

.

Jan van den Ende · ‎04-20-2006

Jim,

Because we want the cluster to survive ..

... if that implies you want your cluster to also survive this operation, you just have to change the order of your actions somewhat:
1.
2.
3.
extra action: dismount QSK clusterwide.
5. (without the hang!)
4,6,7 hybrid: reboot the ES45's - one at a time.
Result: cluster still running, Qdsk replaced by DS10.
At no time any danger of split cluster.
Only between Extra and 5. running on the verge of quorum (an unexpected node leaving causes cluster hang, cancelled by DS10 joining).

--- just the thought experiment by someone who HAS replaced hardware while keeping the cluster available --

Proost.

Have one on me (maybe in May in Nashua?)

jpe

Don't rust yours pelled jacker to fine doll missed aches.

Jim Geier_1 · ‎04-20-2006

Regarding the application, we are running GE/IDX applications using InterSystems Cache' database software. The applications are very stable, and perform well. Cache' is VERY stable and performs extremely well compared to DSM.

We typically have scheduled outages for various things every 2-3 months, and unscheduled individual node outages are running about 1 per quarter since we moved the systems to a new data center a year ago. Prior to that, we had system hardware failures about 2 or 3 times per year. Most failures are memory problems, but 4 or 5 per year does not really make a strong trend.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Migrating from Quorum Disk to Quorum Node

Migrating from Quorum Disk to Quorum Node

Re: Migrating from Quorum Disk to Quorum Node

Re: Migrating from Quorum Disk to Quorum Node

Re: Migrating from Quorum Disk to Quorum Node

Re: Migrating from Quorum Disk to Quorum Node

Re: Migrating from Quorum Disk to Quorum Node

Re: Migrating from Quorum Disk to Quorum Node

Re: Migrating from Quorum Disk to Quorum Node

Re: Migrating from Quorum Disk to Quorum Node

Re: Migrating from Quorum Disk to Quorum Node