topic Re: Question on Clustering in Operating System - OpenVMS

Question on Clustering

Phillip Thayer — Thu, 18 Jan 2007 16:37:30 GMT

I have a question on VMS Clustering that I need to find an answer to.

If you have four systems clustered together where there are two systems at two different data centers, each data center has an EVA with matching LUN's configured and these drives are shadowed bgetween the systems using Volume Shadowing, what happens if the communications between the two datacenters drops and the systems at both datacenters continue to operate? If both pairs of systems continue to run accessing data from their member of ths ashadow set, how is the data synchronized when the communications is returned?

Thanks,
Phil

Re: Question on Clustering

Jan van den Ende — Thu, 18 Jan 2007 16:59:56 GMT

Phillip,

This is where QUORUM comes into play.

If one of the data centers has more votes than the other, that is the one that continues service; the other data center system(s) "hang" in "quorum lost".
If the votes are shared equally, this is true for ALL systems. ( so you better declare a "primary" site, or, preferably, add a "quorum system" at a 3rd site. That may be just a small system doing nothing but watching quorum).

At such times, the (human) decision _CAN_ be taken to tell the "hanging" (minority) data center to continue (AMDS, IPC). Make __D**MN__ sure that anyone authorized to make and execute that decision has verified 200% (very sure, and then double-checked) that the other site is __REALLY__ dead!!

I can testify that in the one occasion that happened to us, it worked as advertised.
(a "brilliant" milennium test of the firealarm cut ALL power, including the high-voltage no-break circuitry)

hth

Proost.

Have one on me.

jpe

Re: Question on Clustering

Hoff — Thu, 18 Jan 2007 17:19:52 GMT

The chosen storage and even the use of volume shadowing isn't central; this looks to be a question of the cluster quorum, the connection manager, and of avoiding a partitioned cluster.

Whichever of the lobes of the data center(s) found to be lacking quorum will halt all processing. If the cluster communications disconnection is sufficiently long that the nodes in the disconnected lobe(s) are ejected from the cluster (and processing will obviously continue on the lobe(s) that retain quorum and connectivity), the ejected nodes will (eventually) CLUEXIT and will reboot and will rejoin the cluster when/if connectivity returns to the configuration, and shadowing will then resynchronize the contents of the disks at the (formerly) disconnected lobe(s) based on the contents from the nodes that have continued. When the node(s) of the lobe(s) are ejected from the cluster, their associated locks and such are automatically released for use by the member node(s) in the remaining lobe(s), and appropriate status values are returned.

In a case where there are exactly balanced lobes and no lobe maintains a majority of the votes (and thus no lobe has quorum by itself), then there is and can be no automated way to choose the surviving lobe and the whole cluster will stall pending manual operator intervention. Another alternative is to pick one of the two lobes, and configure it with more than half the votes. Or to configure a "quorum lobe", and provide connectivity from that lobe to both of the (larger? main?) lobes.

This is a variation of the two-node cluster configuration; where you have two voting nodes. The usual suggestions are a third (voting) node, or a quorum disk on a shared interconnect.

Key to this whole clustering scheme and key to avoiding a partitioned cluster are the VOTES and EXPECTED_VOTES system parameters. A partitioned cluster is a case where disconnected node(s) continue processing without synchronization. This is bad. The OpenVMS Cluster software goes out of its way to avoid this case. (See the OpenVMS FAQ for details on establishing and maintaining the proper settings of VOTES and EXPECTED_VOTES parameters. http://www.hoffmanlabs.com/vmsfaq/ has the most current copy of the FAQ.

All host-based volume shadowing shadowset disks will recover based on the CLUEXIT and subsequent reboot; when connectivity is re-established. The most recently accessed members -- the copies of the volumes at the lobes that have continued processing -- will be the source of the data, as you would expect.

Stephen Hoffman
HoffmanLabs

Re: Question on Clustering

Phillip Thayer — Thu, 18 Jan 2007 18:33:00 GMT

What I am looking for is a solution where the customer can run an application in two different locations (about 150 miles apart) hitting the same data drives. But if one of the locations drops the other location will continue to function without any interruption. When the location that drops comes back on-line they will be able to run the application with all the data updated from the other location.

They are currently running in an OpenVMS environment and I am positive that OpenVMS Clustering is the answer for this with volume shadowing between the two sites. What I am NOT certain about is what to do in the situation where the communications link between the two sites is lost and they end up with two viable clusters in two locations updating the data seperately in two locations.

Do I simply have the customer choose which site is the "parent" site and add a third node as a quorum node in that site. Then if the communications drops, the "child" site will hang with quorum lost until communications is restored?

Phil

Re: Question on Clustering

John Gillings — Thu, 18 Jan 2007 18:59:09 GMT

Phil,

>But if one of the locations drops the
>other location will continue to function
>without any interruption

Depending on your precise definition of "without any interruption" that might not be possible (it may take a few minutes or seconds to recover). If you require recovery at either without some kind of intervention, then you'll need a 3rd "quorum" site. Whichever site can still see the quorum site will continue running. The other will hang.

>where the communications link between the
>two sites is lost and they end up with two
>viable clusters in two locations updating
>the data seperately in two locations.

In a correctly configured cluster, this should not be possible for precisely the reason you're worried about. You can't have the "same" data being independently updated at both sites. That's the whole purpose of QUORUM - to prevent uncoordinated access to shared data.

>Do I simply have the customer choose which
>site is the "parent" site and add a third
>node as a quorum node in that site. Then
>if the communications drops, the "child"
>site will hang with quorum lost until
>communications is restored?

This is relatively simple to do, and at first glance seems attractive. Most customer's think in terms of "primary" and "secondary" sites. However, for some failure scenarios it is NOT a good plan. Consider what Keith Parris calls "creeping doom":

A small fire starts in the COMMS room of your "primary" site, breaking the connection between sites. The "secondary" site loses quorum and hangs. The "primary" site still has quorum, so it continues to process data, updating your data bases.
The fire now spreads, eventually destroying everything at the primary site. Although the secondard is still intact, the data processed between the loss of comms and the destruction of the site is lost.

In contrast, if you had a 3rd quorum site, the loss of comms at the primary site would mean it lost quorum, and would hang. Since the secondary site can still see the qjuorum site, it continues to process data. When the primary site is destroyed, the secondary still has good, up to date copies of the data.

As always it all depends on what failures you want to protect against, your recovery time requirements and how much money you're prepared to spend. Talk to the HP DTCS folk. They live and breath this stuff.

Re: Question on Clustering

Robert Gezelter — Thu, 18 Jan 2007 22:22:37 GMT

Phil,

In a similar situation, what I had was essentially two sites with identical numbers of systems.

In one such situation, I used a quorum disk at the primary site to act as a "tie breaker". You cannot use HBVS to shadow a quorum disk (for the obvious reason).

Basically, this pre-assigns an action in the event of a tie. The contingency of a problem can be resolved by having an alternate quorum disk (pre configured at the otheer site, and not online).

The important thing, IMO, is to ensure that in the event of an emergency, the path to a functioning environment is as short, safe, and accident-proof as possible. I often recommend having alternate system roots with the changes pre-configured, it prevents 0300 accidents.

- Bob Gezelter, http://www.rlgsc.com

Re: Question on Clustering

Hoff — Thu, 18 Jan 2007 22:47:14 GMT

There will never be two viable sub-clusters (lobes, sites, whatever) in multiple locations in a properly-configured cluster.

Never.

With proper settings of the VOTES and EXPECTED_VOTES parameters, there will never be multiple active partitions.

Never.

Partitioning is bad. Really, really bad.

When a subset of the cluster nodes have and can maintain quorum, processing can continue. Shadowset member volumes will be accessible. All will be well within that subset of cluster member nodes.

When a cluster node does not have quorum, the node will hang. No storage modifications are permitted, and all applications and all code requiring quorum is stalled; no applications can be scheduled.

When nodes are unreachable and quorum is available for a sufficient interval, the lower-voting nodes get tossed out of the cluster. These recalcitrant and disconnected and under-quorum'd cluster nodes must be rebooted to re-enter the cluster. You'll see these nodes CLUEXIT if/when connectivity is re-established. As the recalcitrant node(s) did not have quorum, those node(s) will have been deliberately wedged during that interval. No shadowset modifications. No nothing.

System-level code can continue, for specific operations. Device drivers, for instance, can perform certain necessary processing. Application processes, databases, and user-level code in general, however, will be well and truly wedged.

The quorum hang used to be a deliberately- introduced log-jam at IPL 4. Now it is implemented with scheduling, and you'll see the various processes wedged in RWCAP states.

OpenVMS goes out of its way to detect and prevent cluster partitioning. With proper VOTES and EXPECTED_VOTES settings, you won't ever see cluster partitioning.

The OpenVMS FAQ has details of the VOTES and EXPECTED_VOTES scheme, and of the cluster quorum mechanisms. Details of why you can't use host-based volume shadowing for a quorum disk, too.

Re: Question on Clustering

Martin Hughes — Fri, 19 Jan 2007 01:06:46 GMT

Just to add to the points already made regarding the 3rd quorum site, it's typical for this quorum site to just have a cheap server whose only purpose in life is to vote. You aren't necessarily looking at running the application at this site, which means it is not all that costly to set up. Think of it as similar to a quorum disk in a 2 node cluster.

Re: Question on Clustering

Wim Van den Wyngaert — Fri, 19 Jan 2007 03:32:01 GMT

We only have 2 sites. Therefore the quorum server (a 1997 alpha station) is located in the primary building (the one with most of the users) but not in the same room as the server.

Wim

Re: Question on Clustering

Jan van den Ende — Fri, 19 Jan 2007 04:20:47 GMT

Phil,

many moons ago I wrote a "management abstract" about possible cluster configs & consequences.

I allowed to (translate it into English and) post it:

http://forums1.itrc.hp.com/service/forums/questionanswer.do?threadId=724351

Please, feel free (invited even) to make all possible use of it!

Proost.

Have one on me.

jpe

Re: Question on Clustering

Ian Miller. — Fri, 19 Jan 2007 05:03:26 GMT

For more on this see the excellent presentations by Keith Parris at
http://www2.openvms.org/kparris/

I think the presentations given at HP Technology Forum 2006 by KP which are first on that page will be most useful.

Re: Question on Clustering

Phillip Thayer — Fri, 19 Jan 2007 17:21:42 GMT

The third site scenario is not a possibility because the company does not have a third site to use. However, the idea of the third system at the primary user site it probably the best.

I will look at the management abstract and go through all the presentation materials from Keith.

Thanks for all the pointers and I think I will go ahead and close out this thread now.

Thanks,
Phil