Operating System - OpenVMS
1752577 Members
5088 Online
108788 Solutions
New Discussion юеВ

Re: Cluster senario and question

 
Phillip Tusa
Advisor

Cluster senario and question

Greetings to all ...

We have a 3 node cluster of rx-2620s loaded with OpenVMS v8.2-1 and all of our live production applications. I am getting a plan of action together to move one of the nodes to an off-site location. The off-site node will be connected to a T-1 ethernet fibre network communicating with the 2 nodes at the primary location. All three nodes have 2 active NIC cards and have been assigned their respective IP addresses.

SO, in the event the 2 nodes at the primary site loose power -OR- if network connectivity is lost between the primary site and the secondary site, what preparations do I need to make (now before we move the offsite node) to have just the offsite node working - without the other the other 2 nodes. I want to "fake" the offsite node in thinking it is still in the cluster - if that is possible.

How would I configure the offsite node in the event the fibre network "falls asleep" for say more than 2 minutes? I want to try to avoid the offsite node from going into a "bug check" condition - if that is possible?

Hopd these questions were clear.

Thanks in advance for any assistance!

--
Phil










"I'd rather be a VMS guy, any day of the week!"
5 REPLIES 5
Hoff
Honored Contributor

Re: Cluster senario and question

A T1 link is 1.54 Mbps and doesn't meet the minimal 10 Mbps requirements for clustering. That's typically a T3-scale link, to stay within the bounds of the cluster software product description and HP software support. (See the Cluster SPD for details here. That's at http://docs.hp.com/ )

While it's technically feasible to cluster over links with significantly lower bandwidth than the 10 Mbps minimums, these can tend to get tangled up and backlogged when failures arise -- and if you're planning on using shadowing, you really need the bandwidth. And a cluster does not react well to a degraded or backlogged link.

Clustering doesn't use IP (prior to the V8.4 release circa 2009H2). Which means you can and will need to tunnel or bridge the cluster SCS traffic prior to that release.

You're going to be looking at operator intervention and/or a third site; the automated mechanisms will prevent partitioning, and the manual (operational protocol) to restart the cluster in a partitioned environment will have to be worked out and documented. Your "faking" of the quorum here -- breaking or partitioning the cluster -- involves the IPC handler that was (IIRC) first available with V8.2-1; this to break the quorum stall. Details on IPC and how to "break" a quorum are in the manual; basically you ^P the console and answer a question or two. (IPC is another name for the "IPL C handler"; to a low-level console-level interface into clustering and disk management.) This should be done with great care, as you can now have applications active on both sites, and both lobes can be writing to the site-local and available volumes in a multi-volume host-based volume shadowing shadowset, for instance.

Do ensure your VOTES and EXPECTED_VOTES are correct regardless. "Gaming" these settings exposes you to failures during bootstrap, and provides no operational benefits once the cluster links have been initially established.

OpenVMS I64 V8.3-1H1 is a better choice in general, if you can upgrade from V8.2-1.

Some reading material:
http://64.223.189.234/node/153
http://64.223.189.234/node/569

I'd suggest acquiring more formal help here, as this can look quite simple -- but it's seemingly always the weird cases that can nail you with any disaster-tolerance or disaster recovery sequence.

Volker Halle
Honored Contributor

Re: Cluster senario and question

Phil,

if you would give all 3 nodes one vote each, this will yield a cluster quorum of 2 and any 2 nodes can form or continue in the cluster, if one node fails. If the intersite link fails, the 2 nodes at site A will continue automatically. The node at site B will hang and then crash with a CLUEXIT bugcheck (and re-join the cluster), if the intersite link fails for more than RECNXINTERVAL seconds and then comes back.

If site A looses power, you can MANUALLY recover the node at site B by forcing an IPC-interrupt and re-calculating quorum. For Itanium systems, type CTRL-P on the console and then IPC> Q to recalculate quorum.

Volker.
Robert Gezelter
Honored Contributor

Re: Cluster senario and question

Phil,

I agree with Hoff. Please be very careful to avoid creating a partitioned cluster, it is far better to have the remote system crash to its console.

It is far better to have manual intervention than to have a partitioned cluster every time a communications technician switches a cable.

- Bob Gezelter, http://www.rlgsc.com
Phillip Tusa
Advisor

Re: Cluster senario and question

Hello to all ...

Thanks to each of you for your very helpful responses. Needless to say, I am doing alot
of research on this one.

Thanks again, guys!

--
Phil

"I'd rather be a VMS guy, any day of the week!"
Jon Pinkley
Honored Contributor

Re: Cluster senario and question

Phillip,

I am not sure what you mean by "The off-site node will be connected to a T-1 ethernet fibre network communicating with the 2 nodes at the primary location.", but if it is using a dedicated T1 as the only link between sites, then you almost certainly won't want a cluster node at that site.

While you can certainly configure a cluster and get it to "work" over a low bandwidth connection, performance will be much worse than it currently is with all nodes on the same LAN. And having the third node at the remote site will degrade the performance of the two nodes at the primary site.

A suggestion: Before you move anything, try things using a WAN simulator box (a "free" example is freebsd and dummynet plus a PC with two supported ethernet cards). Configure it to have T1 throughput and a latency appropriate for the distance involved, packet loss commensurate with you carrier, and then make that connection the only one to your "third node".

Then decide whether you can get things to work well enough that it is acceptable.

I think you will find that it will be better to think of the remote site as a disaster recovery site than a disaster tolerant cluster.

You have said nothing about how the data (disk storage) will be accessed from the remote site.

Jon

P.S. is this related in any way to your previous question?

http://forums.itrc.hp.com/service/forums/questionanswer.do?threadId=1240674
it depends