Operating System - OpenVMS
1839240 Members
2513 Online
110137 Solutions
New Discussion

Migrating from Quorum Disk to Quorum Node

 
SOLVED
Go to solution
Jim Geier_1
Regular Advisor

Migrating from Quorum Disk to Quorum Node

We have a three-node AlphaServer ES45 OpenVMS cluster running OpenVMS Alpha 7.3-2. Because we want the cluster to survive with only one system, we have a quorum disk with 2 votes, and expected_votes is set to 5. We are going to replace the quorum disk with a quorum node using a DS10 booting off its own, internal disk, running OpenVMS Alpha 8.2.

The steps I was planning on taking for the transition from quorum disk to quorum node are the following:

1. Change the modparams files for the three ES45 systems to indicate DISK_QUORUM = "" and QDSKVOTES = 0, leaving EXPECTED_VOTES at 5. Run Autogen to set the parameters.
2. Copy CLUSTER_AUTHORIZE.DAT from the ES45 system disk to SYS$COMMON:[SYSEXE] on the DS10.
3. Change EXPECTED_VOTES on the DS10 to 5 and VOTES to 2.
4. Shut down all systems.
5. Boot the DS10, expect it to hang waiting for one additional vote.
6. Boot one of the ES45 systems, and there should be a working cluster.
7. Boot the second and third ES45.

Am I missing anything in this plan?
What problems or issues might I expect to encounter?
13 REPLIES 13
Steven Schweda
Honored Contributor

Re: Migrating from Quorum Disk to Quorum Node

How are these systems connected?

> [...] we want the cluster to survive with
> only one system [...]

So, in the new scheme, where's the quorum if
the DS10 and one ES45 go down?

What makes the DS10 better (more reliable?)
than the quorum disk?

I don't see why you'd do it, but your
procedure looks plausible.
Andy Bustamante
Honored Contributor

Re: Migrating from Quorum Disk to Quorum Node


"Using an internal disk on the DS-10" does this mean a single scsi disk? For a quorum node, you may want to consider some form of redundancy on the system disk disk, either with shadowing or with hardware based raid.

Your plan looks good. Another, less immediate option, is to have Availablity Manager/AMDS running on the clustered nodes. You can force quorum to be recalculated on the fly. The downside of course being that this requires manual intervention, the cluster won't continue. If you outages on multiple nodes, it can be a useful tool.

Andy


If you don't have time to do it right, when will you have time to do it over? Reach me at first_name + "." + last_name at sysmanager net
Uwe Zessin
Honored Contributor

Re: Migrating from Quorum Disk to Quorum Node

> So, in the new scheme, where's the quorum if the DS10 and one ES45 go down?

Its gone. Works by design ;-)
Learn about AV/AMDS/IPC to recover the cluster

> What makes the DS10 better (more reliable?) than the quorum disk?

A quorum disk requires multiple watchers that are directly connected to it. Last time I fiddled with it, it did not work nice on a shared parallel SCSI bus.
Too many I/Os to the disk cause failures, too.
.
Jan van den Ende
Honored Contributor

Re: Migrating from Quorum Disk to Quorum Node

Jim,

think over Steven's remark twice (or maybe 3 or 4 times).
I can only think of ONE reason: your "production" systems are running some app(s) that are so flakey that THEY regularly crash your systems. (which in itself would be reason for redesign, but I also know situations where that deep desire is not an option).

Other than that, an uneven number of nodes with equal votes is the most stable config (nice mathematics excersise to prove that!)

A quorum node is really only a real gain if you have two active nodes (it has SOME advantage over a quorum disk), or, and most specifically, if you have 2 active SITES (providing the quorum node is at a third site).
If your active sites have more than one node, there are good reasons to have equal total votes PER SITE, spreading them evenly within each site. (Just use high-enough values per node to reach that condition).

Proost.

Have one on me (maybe in May in Nashua?)

jpe
Don't rust yours pelled jacker to fine doll missed aches.
Steven Schweda
Honored Contributor

Re: Migrating from Quorum Disk to Quorum Node

> Its gone. Works by design ;-)

That's "it's", of course. I'm just trying
to see how the new scheme satisfies the
stated requirement.

> Learn about AV/AMDS/IPC to recover the cluster

I've done this. ("D/I 14 C" is stuck in my
head for some reason.) As above, I fail to
see how the new scheme satisfies the stated
requirement.

If manual intervention is allowed, who needs
more than the "the three ES45 systems"
(without even the quorum disk)?
Jim Geier_1
Regular Advisor

Re: Migrating from Quorum Disk to Quorum Node

The systems each have three GB network adapters (on different PCI busses). Two of those are connected to two Cisco 6509 switches. The third is connected to a Cisco 3550-12T that is not connected to the network (essentially acting as a Star Coupler). No SCACP channel prioritization is being done, VMS distributes the SCS traffic on the channels as it sees fit. SCS traffic is seen on all three channels, although more on the third channel that is dedicated to SCS.

Quorum disks are fine as long as they do not fail. I know it is not supposed to happen in the theory, but in my experience (and I have worked with clusters since the field test of VAX/VMS V4 in 1984) the typical scenario when a quorum disk fails is that the cluster becomes hung, and cannot be recovered without a complete cluster reboot. Not always, but far too often to overlook that possibility as a very likely and even expected outcome.

I don't expect that the DS10 will have a better MTBF than the quorum disk, but replacing the quorum disk with the DS10 will yield better performance, and I suspect a better recovery scenario when the DS10 fails than what has been experienced when a quorum disk fails.

Regarding the question about what happens when the DS10 AND an ES45 fail? In our current configuration, what happens when the quorum disk AND an ES45 fail at the same time? I don't see a real difference between those two scenarios.
Uwe Zessin
Honored Contributor

Re: Migrating from Quorum Disk to Quorum Node

> replacing the quorum disk with the DS10 will yield better performance

Agreed, when I played with this, cluster state transitions went _much_ faster - even with a reduced quorum disk polling interval.
.
Jan van den Ende
Honored Contributor
Solution

Re: Migrating from Quorum Disk to Quorum Node

Jim,


Because we want the cluster to survive ..


... if that implies you want your cluster to also survive this operation, you just have to change the order of your actions somewhat:
1.
2.
3.
extra action: dismount QSK clusterwide.
5. (without the hang!)
4,6,7 hybrid: reboot the ES45's - one at a time.
Result: cluster still running, Qdsk replaced by DS10.
At no time any danger of split cluster.
Only between Extra and 5. running on the verge of quorum (an unexpected node leaving causes cluster hang, cancelled by DS10 joining).

--- just the thought experiment by someone who HAS replaced hardware while keeping the cluster available --

Proost.

Have one on me (maybe in May in Nashua?)

jpe


Don't rust yours pelled jacker to fine doll missed aches.
Jim Geier_1
Regular Advisor

Re: Migrating from Quorum Disk to Quorum Node

Regarding the application, we are running GE/IDX applications using InterSystems Cache' database software. The applications are very stable, and perform well. Cache' is VERY stable and performs extremely well compared to DSM.

We typically have scheduled outages for various things every 2-3 months, and unscheduled individual node outages are running about 1 per quarter since we moved the systems to a new data center a year ago. Prior to that, we had system hardware failures about 2 or 3 times per year. Most failures are memory problems, but 4 or 5 per year does not really make a strong trend.
Jim Geier_1
Regular Advisor

Re: Migrating from Quorum Disk to Quorum Node

Good suggestion, Jan. We do have a scheduled downtime upcoming, so I have a rare time when I can reboot the entire cluster from "cold metal."
William Brown_2
Occasional Advisor

Re: Migrating from Quorum Disk to Quorum Node

"The systems each have three GB network adapters (on different PCI busses). Two of those are connected to two Cisco 6509 switches. The third is connected to a Cisco 3550-12T that is not connected to the network (essentially acting as a Star Coupler). No SCACP channel prioritization is being done, VMS distributes the SCS traffic on the channels as it sees fit. SCS traffic is seen on all three channels, although more on the third channel that is dedicated to SCS."

Hello, we HAD our config set up this way until recently. We had several problems due to some flappig on our main CISCO switches...the connection did not really go away in both directions, so the SCS traffic did not fail over to the other GIG links through other switches. Several times the links recovered and cluster transition ABORTED..several times a node crashed due to the lost connection. HP support suggested that we set our SCS private network to a higher priority to force traffic to that link. We did that (along with Networking fixing their issue on the Cisco switches). That has solved our problem, so far.

Probably a rare case does not happen often, but it did happen to us. If you need additional info/proof I can probably get that from our other Sys Admin that delt with HP directly on this one.

Good day,

Bill
comarow
Trusted Contributor

Re: Migrating from Quorum Disk to Quorum Node

One change I would make. I would raise the priority on the dedicated LAN. As long as it is up, it will use it. I've often seen
clusters use the less desirable path.

Do remember it will test all possible paths so that is also a performance consideration.

I wonder what the reason you are moveing away from a quorum disk? The mean time between failures with new technology is very high, so I don't really see an advantage. The time to move away from Quorum disks is when you have many systems on a SAN, and one node can become the cluster.

Jim Geier_1
Regular Advisor

Re: Migrating from Quorum Disk to Quorum Node

The quorum system has been implemented without problem.