Operating System - OpenVMS
1828210 Members
2335 Online
109975 Solutions
New Discussion

Quorum Disk in DRM solution

 
Howard Arnold
Occasional Contributor

Quorum Disk in DRM solution

I am trying to set up a DRM solution using OpenVMS and EVA storage, but am having a problem trying to figure out how to handle the quorum disk. If the site fails with the quorum disk everything should be okay, but what happens if the site that has the physical quorum disk goes down. My configuration will have one node at each site with the same storage mirrored with volume shadowing between sites. Is there something new within VMS to allow me to volume shadow or have multiple copies of the quorum disks? How have other people resolved this issue?

Thanks,

Howard
11 REPLIES 11
Wim Van den Wyngaert
Honored Contributor

Re: Quorum Disk in DRM solution

Howard,

You should add a dummy node (e.g. an old station) with a vote instead of a quorum disk. This way the cluster stays up as long as 2 nodes out of 3 are up.

I don't have FC but mabe you can use a intersite mirror disk.

Wim
Wim
Howard Arnold
Occasional Contributor

Re: Quorum Disk in DRM solution

I would do that, but I still run into the same problem, if the site with the two nodes goes down, the site with the one node would hang because of expected votes.

Thanks,

Howard
Uwe Zessin
Honored Contributor

Re: Quorum Disk in DRM solution

But that is the purpose of quorum. You need a majority on one side to continue. What do you expect your configuration to do in a 'split-brain' (both are fully functional, but cannot talk to each other) situation? I doubt that you expect that each site continues on its own to produce deviating data, do you?

One 'solution' is to have a third site with another voting member to function as a 'tie-breaker'. Of course, this will fail if both sites loose connection to each other and to the tie-breaker.

You need a mechanism to re-establish quorum at a site where you want to continue. It can be done via IPC> on a server's console, boot another voting member or via a management station that runs Availability Manager (if I correctly remember the name).

Anyway - there is no magic 'silver bullet' against desasters.
.
Wim Van den Wyngaert
Honored Contributor

Re: Quorum Disk in DRM solution

The problem is that most of us only get 2 buildings. You need 3 to be correct.
We have the quorum station in another room than the server.

Wim
Wim
Jan van den Ende
Honored Contributor

Re: Quorum Disk in DRM solution

Howard,

forget about DRM in combination with VMS.
You really want to use HostBasedVolumeShadowing!
And no, there is NO way to play tricks with the quorum disk.
If you are absolutely limited to two sites, then decide which site should survive a break of connections (of course these ARE, redundant, and NEVER do the redundant paths come close to one another! ? ).
Adjust your votes to make it so.
If multi-site, forget about quorum DISK, but think of a quorum SITE.
The same quorum principles apply.
(the only working, but no very cost-effective, way for a quorum disk would be to have a 3-site SAN, with the quorum disk at the 3rd SAN site).

Shadowed quorum disk?
At the moment your sites loose contact, that also means your quorum disk shadow set separates. Each site sees its own nodes' votes, and the vote of the quorum disk. Effectively you have ADDED a vote, because now EACH member of the former quorum disk set presents its own view of "the" quorum disk vote!
Each site goes on independantly. Within miliseconds your disks are inconsistent, and quorum was invented just to prevent THAT!!

So,
eigther create a quorum site (best solution)
or create a 'primary' and a 'secondairy' site, by unbalancing your votes,
or accept that on loosing connection everything 'freezes', until resolved one way or another.

And this is NOT your decision!

It is your responsibility to get this understood by authorative management, and if they don't choose 3-site, make sure you have IN WRITING that they KNOW the risk, and THEY decided to run that risk!

And any way they choose, it will result in a config that is more resilient than any attainable in U*X or MS. (although some IBM configs, or NSK certainly beat the last two options)



I wish you success!

Jan

Don't rust yours pelled jacker to fine doll missed aches.
Wim Van den Wyngaert
Honored Contributor

Re: Quorum Disk in DRM solution

Howard,

Your management will not write such a document.

If you implement a site with 2 nodes with 1 site have majority, you risk that a cpu or memory problem stops production.

So : if you have fc and use a quorum disk, you can continue your production as long as you see your remote quorum disk.

If you don't have interbuilding fc, I would search a cheap workstation and use that in a third place (in the major building). If that is not possible go for Jan's solution.

Wim
Wim
Howard Arnold
Occasional Contributor

Re: Quorum Disk in DRM solution

Thanks for the quick responses. I think I will have to choose which site has the least chance of failing and have the quorum disk there until it can be decided if there will be a third site.

Howard
Wim Van den Wyngaert
Honored Contributor

Re: Quorum Disk in DRM solution

That's what I would do ...

Wim
Wim
Cass Witkowski
Trusted Contributor

Re: Quorum Disk in DRM solution

Actually for Disaster Tolerant designs you want to lose quorum if one site fails. This is to avoid the creeping doom situation.

Let's say you have two sites A and B. Site A has 1 half of shadow set and quorum disk and Site B has other half of shadow set. You have a link between sites for disk as well as clustering traffic.

One fine day at 10:00 am the link goes down between the sites. Site B not having quorum would hang. Site A has the quorum disk continues to process transactions. Unfortunately the reason the link between the sites went down was that in the communication room at Site A a router burst into flames.

At 10:20 the fire spread into the computer room at Site A causing it's destruction.

Here is the problem. You lost the data at Site A that had 20 minutes worth of transactions. This could be patients orders, stock transactions, back transactions, lotto sales, etc. What do you do. All you have left is old data at Site B.

This is the creeping doom scenario. To solve this you stop processing at both sites. Then you ascertain the reason for the loss of the link between the sites and then choose the proper site to continue processing at. You can use a PC or non-clustered Alpha server running AMDS or what ever it is called now to readjust quorum at the site you want to continue processing at.

If you get a chance take a class from Keith Parris now back at HP on Disaster Tolerant configurations. It was very eye-opening for us. A lot of things you do in a DT is counter-intuitive to how you do things normally.

Cass
Ian Miller.
Honored Contributor

Re: Quorum Disk in DRM solution

See KP presentations on DT configurations at
http://www2.openvms.org/kparris/
Keith Parris is the expert on these things and has much real experiance.
He has been known to answer questions here too!
____________________
Purely Personal Opinion
Keith Parris
Trusted Contributor

Re: Quorum Disk in DRM solution

While there are some customers who use DRM (or Continuous Access, the new name) with OpenVMS -- most commonly to meet disaster recovery needs (for example, Manhattan Municipal Employees Credit Union put DRM in place as a result of the lessons learned in their 9/11 outage in NYC -- see http://www.nwfusion.com/news/2002/0902lessonsside1.html and http://www.totaltec.com/case_mcu.htm) -- most OpenVMS Cluster customers actually use Host-Based Volume Shadowing instead, for three main reasons: failover time and automatic failover, data access between sites, and cost.

With DRM/CA, it will take manual (or at best, scripted) action at the controller level to fail the cross-site mirrorsets over to the controller at the other site. This could take 15 minutes. With HBVS, the failover is automatic, and can happen within seconds (as controlled by the SHADOW_MBR_TMO parameter). Even if you have to manually recover quorum, it is actually much quicker (taking mere seconds) to recover quorum at the surviving site using Availability Manager (or DECamds) than to perform the manual failover for the DRM/CA disks. So it would be quicker to do this than to fail over a DRM-protected quorum disk even if you had one. And of course if you have a VMS system at a 3rd site the handling of quorum is automatic (and having the 3rd site also takes away most of the potential of encountering a "creeping doom" scenario).

With DRM/CA, disk units can be accessed only through the controllers at one site at a time. If, as many OpenVMS Cluster customers do, you want to run applications at both sites at once, this means all access to the disks from one site will have to be done remotely through the controller at the other site, either through a SAN connection between sites or through the VMS MSCP Server. With HBVS, all read operations can be directed to the units at the same site, and only the writes must go to both sites. Since I/Os in most OpenVMS environments tend to have a preponderance of reads compared with writes, this is a win for HBVS compared with DRM/CA in performance terms.

Finally, most customers find the license costs for DRM/CA to be higher than those for Volume Shadowing.

I find DRM/CA to be much more popular in conjunction with operating systems other than VMS, which don't have HBVS or whose clustering capabilties are more primitive, such that they can't run applications at both sites at once.

Quorum considerations for multi-site and disaster-tolerant clusters (including the "Creeping Doom" scenario) are covered in depth in various "Using OpenVMS for Disaster Tolerance" presentations at the website the previous poster kindly pointed out: http://www2.openvms.org/kparris/

There's also a good summary in the OpenVMS Techical Journal V1 article of the same title at http://h71000.www7.hp.com/openvms/journal/v1/index.html