- Community Home
- >
- Servers and Operating Systems
- >
- Operating Systems
- >
- Operating System - OpenVMS
- >
- Question on Clustering
Categories
Company
Local Language
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Forums
Discussions
Discussions
Discussions
Forums
Discussions
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Community
Resources
Forums
Blogs
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-18-2007 08:37 AM
01-18-2007 08:37 AM
If you have four systems clustered together where there are two systems at two different data centers, each data center has an EVA with matching LUN's configured and these drives are shadowed bgetween the systems using Volume Shadowing, what happens if the communications between the two datacenters drops and the systems at both datacenters continue to operate? If both pairs of systems continue to run accessing data from their member of ths ashadow set, how is the data synchronized when the communications is returned?
Thanks,
Phil
Solved! Go to Solution.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-18-2007 08:59 AM
01-18-2007 08:59 AM
SolutionThis is where QUORUM comes into play.
If one of the data centers has more votes than the other, that is the one that continues service; the other data center system(s) "hang" in "quorum lost".
If the votes are shared equally, this is true for ALL systems. ( so you better declare a "primary" site, or, preferably, add a "quorum system" at a 3rd site. That may be just a small system doing nothing but watching quorum).
At such times, the (human) decision _CAN_ be taken to tell the "hanging" (minority) data center to continue (AMDS, IPC). Make __D**MN__ sure that anyone authorized to make and execute that decision has verified 200% (very sure, and then double-checked) that the other site is __REALLY__ dead!!
I can testify that in the one occasion that happened to us, it worked as advertised.
(a "brilliant" milennium test of the firealarm cut ALL power, including the high-voltage no-break circuitry)
hth
Proost.
Have one on me.
jpe
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-18-2007 09:19 AM
01-18-2007 09:19 AM
Re: Question on Clustering
Whichever of the lobes of the data center(s) found to be lacking quorum will halt all processing. If the cluster communications disconnection is sufficiently long that the nodes in the disconnected lobe(s) are ejected from the cluster (and processing will obviously continue on the lobe(s) that retain quorum and connectivity), the ejected nodes will (eventually) CLUEXIT and will reboot and will rejoin the cluster when/if connectivity returns to the configuration, and shadowing will then resynchronize the contents of the disks at the (formerly) disconnected lobe(s) based on the contents from the nodes that have continued. When the node(s) of the lobe(s) are ejected from the cluster, their associated locks and such are automatically released for use by the member node(s) in the remaining lobe(s), and appropriate status values are returned.
In a case where there are exactly balanced lobes and no lobe maintains a majority of the votes (and thus no lobe has quorum by itself), then there is and can be no automated way to choose the surviving lobe and the whole cluster will stall pending manual operator intervention. Another alternative is to pick one of the two lobes, and configure it with more than half the votes. Or to configure a "quorum lobe", and provide connectivity from that lobe to both of the (larger? main?) lobes.
This is a variation of the two-node cluster configuration; where you have two voting nodes. The usual suggestions are a third (voting) node, or a quorum disk on a shared interconnect.
Key to this whole clustering scheme and key to avoiding a partitioned cluster are the VOTES and EXPECTED_VOTES system parameters. A partitioned cluster is a case where disconnected node(s) continue processing without synchronization. This is bad. The OpenVMS Cluster software goes out of its way to avoid this case. (See the OpenVMS FAQ for details on establishing and maintaining the proper settings of VOTES and EXPECTED_VOTES parameters. http://www.hoffmanlabs.com/vmsfaq/ has the most current copy of the FAQ.
All host-based volume shadowing shadowset disks will recover based on the CLUEXIT and subsequent reboot; when connectivity is re-established. The most recently accessed members -- the copies of the volumes at the lobes that have continued processing -- will be the source of the data, as you would expect.
Stephen Hoffman
HoffmanLabs
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-18-2007 10:33 AM
01-18-2007 10:33 AM
Re: Question on Clustering
They are currently running in an OpenVMS environment and I am positive that OpenVMS Clustering is the answer for this with volume shadowing between the two sites. What I am NOT certain about is what to do in the situation where the communications link between the two sites is lost and they end up with two viable clusters in two locations updating the data seperately in two locations.
Do I simply have the customer choose which site is the "parent" site and add a third node as a quorum node in that site. Then if the communications drops, the "child" site will hang with quorum lost until communications is restored?
Phil
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-18-2007 10:59 AM
01-18-2007 10:59 AM
Re: Question on Clustering
>But if one of the locations drops the
>other location will continue to function
>without any interruption
Depending on your precise definition of "without any interruption" that might not be possible (it may take a few minutes or seconds to recover). If you require recovery at either without some kind of intervention, then you'll need a 3rd "quorum" site. Whichever site can still see the quorum site will continue running. The other will hang.
>where the communications link between the
>two sites is lost and they end up with two
>viable clusters in two locations updating
>the data seperately in two locations.
In a correctly configured cluster, this should not be possible for precisely the reason you're worried about. You can't have the "same" data being independently updated at both sites. That's the whole purpose of QUORUM - to prevent uncoordinated access to shared data.
>Do I simply have the customer choose which
>site is the "parent" site and add a third
>node as a quorum node in that site. Then
>if the communications drops, the "child"
>site will hang with quorum lost until
>communications is restored?
This is relatively simple to do, and at first glance seems attractive. Most customer's think in terms of "primary" and "secondary" sites. However, for some failure scenarios it is NOT a good plan. Consider what Keith Parris calls "creeping doom":
A small fire starts in the COMMS room of your "primary" site, breaking the connection between sites. The "secondary" site loses quorum and hangs. The "primary" site still has quorum, so it continues to process data, updating your data bases.
The fire now spreads, eventually destroying everything at the primary site. Although the secondard is still intact, the data processed between the loss of comms and the destruction of the site is lost.
In contrast, if you had a 3rd quorum site, the loss of comms at the primary site would mean it lost quorum, and would hang. Since the secondary site can still see the qjuorum site, it continues to process data. When the primary site is destroyed, the secondary still has good, up to date copies of the data.
As always it all depends on what failures you want to protect against, your recovery time requirements and how much money you're prepared to spend. Talk to the HP DTCS folk. They live and breath this stuff.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-18-2007 02:22 PM
01-18-2007 02:22 PM
Re: Question on Clustering
In a similar situation, what I had was essentially two sites with identical numbers of systems.
In one such situation, I used a quorum disk at the primary site to act as a "tie breaker". You cannot use HBVS to shadow a quorum disk (for the obvious reason).
Basically, this pre-assigns an action in the event of a tie. The contingency of a problem can be resolved by having an alternate quorum disk (pre configured at the otheer site, and not online).
The important thing, IMO, is to ensure that in the event of an emergency, the path to a functioning environment is as short, safe, and accident-proof as possible. I often recommend having alternate system roots with the changes pre-configured, it prevents 0300 accidents.
- Bob Gezelter, http://www.rlgsc.com
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-18-2007 02:47 PM
01-18-2007 02:47 PM
Re: Question on Clustering
Never.
With proper settings of the VOTES and EXPECTED_VOTES parameters, there will never be multiple active partitions.
Never.
Partitioning is bad. Really, really bad.
When a subset of the cluster nodes have and can maintain quorum, processing can continue. Shadowset member volumes will be accessible. All will be well within that subset of cluster member nodes.
When a cluster node does not have quorum, the node will hang. No storage modifications are permitted, and all applications and all code requiring quorum is stalled; no applications can be scheduled.
When nodes are unreachable and quorum is available for a sufficient interval, the lower-voting nodes get tossed out of the cluster. These recalcitrant and disconnected and under-quorum'd cluster nodes must be rebooted to re-enter the cluster. You'll see these nodes CLUEXIT if/when connectivity is re-established. As the recalcitrant node(s) did not have quorum, those node(s) will have been deliberately wedged during that interval. No shadowset modifications. No nothing.
System-level code can continue, for specific operations. Device drivers, for instance, can perform certain necessary processing. Application processes, databases, and user-level code in general, however, will be well and truly wedged.
The quorum hang used to be a deliberately- introduced log-jam at IPL 4. Now it is implemented with scheduling, and you'll see the various processes wedged in RWCAP states.
OpenVMS goes out of its way to detect and prevent cluster partitioning. With proper VOTES and EXPECTED_VOTES settings, you won't ever see cluster partitioning.
The OpenVMS FAQ has details of the VOTES and EXPECTED_VOTES scheme, and of the cluster quorum mechanisms. Details of why you can't use host-based volume shadowing for a quorum disk, too.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-18-2007 05:06 PM
01-18-2007 05:06 PM
Re: Question on Clustering
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-18-2007 07:32 PM
01-18-2007 07:32 PM
Re: Question on Clustering
Wim
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-18-2007 08:20 PM
01-18-2007 08:20 PM
Re: Question on Clustering
many moons ago I wrote a "management abstract" about possible cluster configs & consequences.
I allowed to (translate it into English and) post it:
http://forums1.itrc.hp.com/service/forums/questionanswer.do?threadId=724351
Please, feel free (invited even) to make all possible use of it!
Proost.
Have one on me.
jpe
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-18-2007 09:03 PM
01-18-2007 09:03 PM
Re: Question on Clustering
http://www2.openvms.org/kparris/
I think the presentations given at HP Technology Forum 2006 by KP which are first on that page will be most useful.
Purely Personal Opinion
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-19-2007 09:21 AM
01-19-2007 09:21 AM
Re: Question on Clustering
I will look at the management abstract and go through all the presentation materials from Keith.
Thanks for all the pointers and I think I will go ahead and close out this thread now.
Thanks,
Phil