Re: Disaster tolerant cluster configurations.

Jan van den Ende · ‎10-22-2004

Hi all,

recently I was asked to prepare for management a one-page overview of VMS cluster configs for (continuous availability and) disaster tolerance. Leave out techie lingo, and highlight the differences.

Since it was considered as providing unknown insights, I decided to take the trouble of translating it and publishing here.

Use it at any opportunity you see for it, just leave me credentials (they are also there to add weight in the eyes of those at whom it is addressed)

... and do not make it a secret if it succeeds in what it is obviously intended for :-)

Cheers.

Have one on me.

Jan

Don't rust yours pelled jacker to fine doll missed aches.

Keith Parris · ‎10-22-2004

Thanks for this excellent write-up. I think it summarizes the options and the benefits of each extremely well. Thanks for sharing it with us.

Many of the readers of this forum may be unfamiliar with the OpenVMS platform and the capabilities it has compared with many other operating system platorms with respect to disaster tolerance -- in my opinion, it is far and away the strongest offering HP has in this area. I have done a comparison of disaster-tolerant capabilities across all of HP's platforms, entitled "Disaster-Tolerant Cluster Technology & Implementation", as well as sessions giving more details about disaster tolerance using OpenVMS, in user-group presentations which can be found at http://www2.openvms.org/kparris/

Just a couple of comments on your superb paper:

Within a site, you recommend at least 2 nodes. If access to the storage within that site from nodes at other sites is provided indirectly via the VMS MSCP Server code, then having at least 2 nodes at a site allows access to continue (and data on the shadowset member(s) at the site to remain up-to-date) despite downtime of one of the VMS nodes due to a failure or something like an operating system upgrade. With the availability in some cases now of inter-site Fibre Channel links, access to storage at a site may now continue whether or not there is even 1 VMS node up at that site, so that new technology can make availability even better when it is available. Of course, if you lose one or more sites, and you may have to run for an extended period of time in a reduced configuration, having multiple nodes at a site is very valuable for high availability of the application, and that may be reason enough to have 2 or more nodes at every site. (I just wanted to point out that the VMS Cluster technology itself doesn't require it.)

In Scenario D, you describe both sites as being secondary. Did you mean both sites can be considered as being primary?

Jan van den Ende · ‎10-22-2004

Keith,

In Scenario D, you describe both sites as being secondary. Did you mean both sites can be considered as being primary?

What that is intended to mean, is that neigther site can continue 'automaticly', ie, without human intervention. In the Dutch version that intention was obviously understood as such by the intended (and other) readers, and I really think the translation did not leave out or add meaning.

Maybe other terminology ("dark" or "lights-out" site vs 'light' or 'manned' site) are more known, but those imply more than I intended to.
It should be read as "Primary site" can withstand ONE other site falling over, and continue providing service unassisted, where a "Secondary site" requires at least some human intervention before continuing delivering service. (and while a cluster with lost quorum IS RUNNING, it is NOT providing service. It just waits for something to break the wait, one way or another).

Well, that makes the explanation almost longer than the paper itself, but review the specs:
Max one page, & no techie talk.
(And any questions come up, delegate them!)

Cheers.

Join me in a beer.
(I am right now having a new beer, part of a mixed evaluation package in limited production, one of which is to be chosen to get in full production.
Really tough job! :-)

Jan

Don't rust yours pelled jacker to fine doll missed aches.

comarow · ‎07-11-2005

In a disastor, even with three sites, only one might be communication can be lost.

It must be decided which becomes he primarysite in that situation
and other systems halted and quorum
manually restarted.

Jan van den Ende · ‎07-11-2005

Bob,

I am not sure that I get your intension.
Yes, at the simultanious loss of two sites you loose quorum.
But, if first 1 site goes, AND IS DECLARED GONE, recalculating quorum restores a full 2-site cluster.
If the initial vote scheme is, eg, 3-4-5, then after failure of one site and resetting quorum, a primary-secundary cluster remains, and the loss of the new secondary can be tolerated.
And really, if TWO of the three sites go in the same timeframe, you ARE in all kinds of trouble, of which the need to recalculate quorum in unlikely to be the biggest!

But yes, planning DT is all about worst-case scenarios, so this also has to be dealt with.

Proost.

Have one on me.

jpe

Don't rust yours pelled jacker to fine doll missed aches.

Keith Parris · ‎07-12-2005

Some customers' availability requirements are such that they have 3-member shadowsets, with 1 member at each of 3 different sites. Some even have a 4th quorum site in addition to the 3 "main" sites. In such a case, if you were to give the quorum site one less vote than the total of all the other votes, then the cluster could continue to run without manual intervention despite loss of any two of the main sites; alternatively, the cluster continue operating without manual intervention despite the loss of the quorum site. (As a simplified example, say each of the 3 main sites which have 1 shadowset member each are each given 1 vote, and the quorum node at the quorum site has 2 votes. EXPECTED_VOTES would be 5, Quorum would be calculated as 3, and the quorum node's 2 votes plus the vote from any single one of the 3 sites would still achieve Quorum.

Wim Van den Wyngaert · ‎07-25-2005

"As soon as it is clear that is is an attack it is decided to block mutations, but continue query functionality".

This is very specific to your appliction. Most applications do not allow that so stopping everything is an alternative. But why did you block mutations ?

Other remakrs may be to tech ...

Don't forget to mention that people also have to be distributed just as the cluster nodes.

And all other network parties must be accessable via connections in the site (we have 2 buildings 5 km appart but they are in 1 LAN, so e.g. 1 Bloomberg node has an IP address in each building and a building specific route to that node).

And what's the use of interbuilding clusters when not all nodes of the chain are interbuilding clusters ?

Wim

Wim

Jan van den Ende · ‎07-26-2005

Wim,

..specific to your application.

Hardly.
Read again, this example is from "9/11".
(Eilas, we do not have a 3-site config!)
In this case, the app was "Online Stock Exchange Trading", and it is entirely defendable that they DID decide to stop trading in view of events! At the same time, it was VERY desirable that querying remained functional.
And they DID switch trading back on when commercially appropriate, before the full cluster was restored to operation...

Hoping to have clarified things a bit.

Proost.

Have one on me.

jpe

Don't rust yours pelled jacker to fine doll missed aches.

Wim Van den Wyngaert · ‎07-26-2005

Jan,

OK but I have not seen 1 application where you can disable updates. And 9/11 is a bad example of a typical disaster because stock exchange was halted (for 1 week if I remember correctly, I remember because I lost a lot of money then).

If you have a disaster at your site, the world will not stop but continue without you.

(I thought they stopped updates for the "creeping doom" phenomena, I would in any case if I'm there in time)

Wim

Wim

Jan van den Ende · ‎07-26-2005

Wim,

OK but I have not seen 1 application where you can disable updates

???
We are running a DBMS app - it can
We are running a Basis+ app - it can
We are running a Progress app - it can.

... and I hardly suspect that list to be complete.

And 9/11 is a bad example of a typical disaster

Is it?
How "typical" could a disaster be? The real "typical" in my view is the whole idea of being unexpected, with no way to be prepared other than preparedness for "anything unforeseen".
And then 9/11 certainly qualifies..

But remember, in the context of the original paper, this was only AN example of A possible setup, and how it worked out in ONE case.
See the specs; NO techie details... which we overstepped by far already!

Proost.

Have one on me.

jpe

Don't rust yours pelled jacker to fine doll missed aches.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Disaster tolerant cluster configurations.

Disaster tolerant cluster configurations.