Operating System - OpenVMS
1827473 Members
1666 Online
109965 Solutions
New Discussion

Re: disaster tolerance on cluster DS20 / RA7000

 
SOLVED
Go to solution
Dominique_11
Frequent Advisor

disaster tolerance on cluster DS20 / RA7000

Hi,

My cluster configuration is:

2 Alphaserver DS20E connected on the shared RA7000 raid enclosure.
I just have a JBOD in my RA7000 and use SHADOWING on VMS.

I want to Install a disaster tolerance configuration in the remote site (800 m)
with the same configuration.

Question: If I install the same configuration on the remote site, and just connected with Ethernet (100Mb), can I mount one 3rd member (of an existing shadow set on the remote RA7000?

Regards,
dominique
17 REPLIES 17
Wim Van den Wyngaert
Honored Contributor
Solution

Re: disaster tolerance on cluster DS20 / RA7000

Karl Rohwedder
Honored Contributor

Re: disaster tolerance on cluster DS20 / RA7000

Host based shadowing may consist of up to three members, but beware, you cannot shadow the system disk, each site should have its own systemdisk. Datadisks should use hardware raid on each site and be host based shadowed across the sites.
Read the manuals carefully and check for other resources, there were e.g. some articles in the Technical Journals (http://h71000.www7.hp.com/openvms/journal/toc.html).

regards Kalle
Ian Miller.
Honored Contributor

Re: disaster tolerance on cluster DS20 / RA7000

See also Keith Parris excellent presentations

http://www2.openvms.org/kparris/
____________________
Purely Personal Opinion
Jan van den Ende
Honored Contributor

Re: disaster tolerance on cluster DS20 / RA7000

Dominique,

I would also use (at least) TWO connections between the sites. And make sure they are along _DIFFERENT_ geographical paths (ALL the way, even inside both buildings.

hth

Proost.

Have one on me.

jpe
Don't rust yours pelled jacker to fine doll missed aches.
Wim Van den Wyngaert
Honored Contributor

Re: disaster tolerance on cluster DS20 / RA7000

BTW : to be redundant, you better take a raid set (mirror or 5) in each building and shadow these (so 2 members). Disasters never come alone.

Also verify boot times of routers and switches between the 2 buildings and make sure that the cluster survives them (larger time out). Unless the application can stand that of course (in which case you have to break the cluster and do a shodow copy when it comes back).

Wim
Wim
Jan van den Ende
Honored Contributor

Re: disaster tolerance on cluster DS20 / RA7000

Re Kalle

but beware, you cannot shadow the system disk, each site should have its own systemdisk

Well, not entirely true.
The important part is, to always maintain a consistant state.
We concluded, that to separate system disks on two sites was (for us) not the best guantee for continuous consistency.
We have configured _ALL_ of our nodes to network boot as satellites of any of the others.
That means, initially the system disk will be MSCP served. The moment a direct path to the SAN is established, that dails over.
It _DOES_ imply, that should we ever need a _CLUSTER_ boot, then only manual operation is possible.

Re Wim:

to be redundant, you better take a raid set (mirror or 5) in each building and shadow these

Yes, at least! And to be able to make BACKUPs by removing a member, we want to LEAVE that config, so, except during the copy of that memer to tape, we have 3 members.
And all this extra config complexity is why I was so pleased with the plans to allow more members. But those did not make it to 8.3, so surely any hope of back-porting is meaningless.

hth

Proost.

Have one on me.

jpe
Don't rust yours pelled jacker to fine doll missed aches.
Wim Van den Wyngaert
Honored Contributor

Re: disaster tolerance on cluster DS20 / RA7000

Jan : our applications are 24/24. So we use db dumps and not shadow member backups.

Dominique : if you consider an interbuilding SAN, consider also that 1 SAN can have problems that bring the whole SAN down (we had it during 15 hours on the PC SAN).

2 SANS with shadowing between them seems the most reliable and stable solution. But not the fastest.

Wim
Wim
Dominique_11
Frequent Advisor

Re: disaster tolerance on cluster DS20 / RA7000

Thank you for your answers,

I think that it there of a confusion. You will find in attach the configuration that my customer would like.

I don't have a SAN and my systems are connected only by Ethernet.

cordialy
dominique
Jan van den Ende
Honored Contributor

Re: disaster tolerance on cluster DS20 / RA7000

Dominique,

As a starter, I noticed VMS 7.3.
I would _MOST STRONGLY_ advise to go to 7.3-2, with latest or near-latest UPDATE patch.
In your proposed config it would be a real pity if you do NOT use HBMM (Host Based MiniMerge), which is only available starting 7.3-2 with HBMM V2 (included in the later Update Patches).

Your picture does not specify HOW the links are located (and you should make SURE they use different geographical pathways: it would be a pity to have it dual configured through 2 adjacent cables that can be dug out by a single digging action!)
I also noted that at both sites they seem to use just ONE network switch? That would introduce 2 SPOFs (SinglePointOfFailure), which is not a good idea for DT.

But: Keep on thinking, keep on designing, keep on asking. _THIS_ is the time when things can be changed rather easily and rather cheaply, later on will be MUCH more difficult, and MUCH more costly!

hth

Proost.

Have one on me.

jpe
Don't rust yours pelled jacker to fine doll missed aches.
Dominique_11
Frequent Advisor

Re: disaster tolerance on cluster DS20 / RA7000

Thank you Jan,


I thought of installing the following elements in the final configuration: OpenVMS 7.3-2 with the last patches.
2 network switchs on each site.

The questions which remain unanswered are:

The two sites are located at approximately 1 km one of the others , I am afraid to have problems of blocking (shadow Copy with MSCP Serve) or (cluster Hang) knowing that the only common way which connect them is an 100 Mb Ethernet link RJ45/Fiber.

The configurations which I know are composed of RA8000 (SAN) which are connected between it with a Fibre Chanel. (HSH80 - > SAN Switch - > Multimode fiber - > SAN Switch - > HSG80).

For the Quorum disc which is installed in on the 1st site.
The shadow is not authorize? how to replicate this disk?

Cordialy,
dominique
Ian Miller.
Honored Contributor

Re: disaster tolerance on cluster DS20 / RA7000

the recommended config is no quorum disk and the same votes at each site.
____________________
Purely Personal Opinion
Wim Van den Wyngaert
Honored Contributor

Re: disaster tolerance on cluster DS20 / RA7000

We have the majority of the votes in the site where the users are. This is done thru 1 old Alpha station that gives a vote and arbitrates when a node goes down (better than a q disk).

Wim
Wim
Andy Bustamante
Honored Contributor

Re: disaster tolerance on cluster DS20 / RA7000


>>For the Quorum disc which is installed in on the 1st site.

As Wim states, you should consider doing without a quorum disk and configuring votes to either allow your primary site to continue or add a third "quorum site" to allow continuous operation of either site.

I'd also recommend configuring Availability Manger or AMDS (on a non clustered node). Define procedures for recovering quorum "on the fly" now, before there are problems. Carefully planned this can allow you to restore service quickly if a site becomes unavailable. Without planning, this could create a partitioned cluster and data integrity issues.

Andy
If you don't have time to do it right, when will you have time to do it over? Reach me at first_name + "." + last_name at sysmanager net
Dominique_11
Frequent Advisor

Re: disaster tolerance on cluster DS20 / RA7000

Thank you very much for your councils.

Another questions

It is possible to add the member of different size in a shadow set .
ex: DSA0 = 1 lun RA7000 (18 GB) + 1 lun MSA1000 (36 GB).


Which is the maximum length advised to use shadow through "MSCP serve" disk

Dominique
Ian Miller.
Honored Contributor

Re: disaster tolerance on cluster DS20 / RA7000

Since VMS V7.3-2 disks do not have to be the same size. See Disimilar Disk Shadowing (DDS) in the docs
http://h71000.www7.hp.com/doc/732FINAL/aa-pvxmj-te/aa-pvxmj-te.HTML
____________________
Purely Personal Opinion
Jan van den Ende
Honored Contributor

Re: disaster tolerance on cluster DS20 / RA7000

dominique,

Re quorum disk:
"Quorum disks are a bad idea. One thing worse is a 2-node cluster without QDsk, and the worst are non-VMS clustering solutions".

Seriously, you NEED a quorum referee, and ANY active node (even the smallest Vax station) is much superior to a quorum disk.

A quorum disk can not, and should not, be shadowed, and certainly not inter-site.
Think of this simple scenario:
Both sites equal votes, with shadowed Qdsk.
Intersite link breaks. Now EACH site has half the active nodes, PLUS, his (now single-member) quorum disk = QUORUM. So, each site continues independently, each (now independantly) modifying ITS members of the data shadow sets. How quickly would those be significantly different at your site? And now, for the final touch, the intersite link comes back.....
That is also why it is VERY UNWISE ( = downright stupid) to use the hardware capability created by multisite SAN controller-based replication.
VMS has no way to know of this, so also no way of preventing it, and that is the dangerous part.

The available solutions are
1. _THREE_ active sites (some 9/11 companies with one in each tower, plus a third somewhere else, are still in business BECAUSE of that, but try to sell THAT to the average, or even above-average, management!)
2. A quorum site. Technically also 3 sites, but one of those can be just a locked closet housing, say, a DS10, somewhere along one of your cabling routes
3. Declare one site "Primary". During normal operation that site has one or a few more votes than the secondary, and it will be the site that continues when the links disappear.
(this is what we are using)
4. Choose for equal-weight sites. Now ANY intersite breakup needs human evaluation, and human intervention. It may well be a wise choice, but you have to make _DAMN SURE_ that _ANY_ human that _MIGHT_ be on duty when it happens, is _VERY SURE_ about what to do, or who to warn, before doing anything.
The already given suggestion of implementing AM or AMDS I fully endorse!

Proost.

Have one on me.

jpe
Don't rust yours pelled jacker to fine doll missed aches.
Dominique_11
Frequent Advisor

Re: disaster tolerance on cluster DS20 / RA7000

Thank you with all for your answers.

I now have all the elements to configure the new cluster and use shadow as well as possible.

Cordialy,
dominique