Operating System - OpenVMS
1753481 Members
4508 Online
108794 Solutions
New Discussion юеВ

"Standby" Node Management Advice

 
SOLVED
Go to solution
Jack Trachtman
Super Advisor

"Standby" Node Management Advice

I'm used to configuring either single-node or cluster systems. I have a project that is a little different and am looking for advice.

I will be configuring a stand-alone node, but with a dedicated, cold-standy back up system. Some quick info:

1) 2 ES45s
2) All disks (including system disks) will be on an HP SAN
3) All disks will be presented to both nodes
4) The BOOTDEF_DEV of each node will point to the same system disks
5) If the active node fails, the recovery procedure will be to confirm that it is "down", then boot the "cold" node.
5) We expect to do a monthly test of shutting down the active node & booting the "cold" node.

My request: what processes/procedures can I put in place to reduce (if not eliminate) the possibility that the "cold" node will be booted while the active node is running (which, I expect, would immediately corrupt the system disk, etc)

TIA
15 REPLIES 15
Robert Gezelter
Honored Contributor
Solution

Re: "Standby" Node Management Advice

Jack,

I will first make the somewhat obligatory observation that this is one of the most basic uses of OpenVMS clusters.

That said, I would not use the same system disk for both systems. I would also make sure that the procedures ensure that the "dead" CPU is permanently disabled, an accidental boot could be a serious problem.

There are few safeguards that one can construct without using OpenVMS clusters. Off the cuff, one could do something with multi-homed IP and PING with each of the systems having a unique IP address and trying to reach the other node. If the other node responds, STOP. However, this is a bit of a hack and it only guards against startup while the other node is running If network connectivity is lost, but the SAN remains operational, one is back where one started from.

On the OpenVMS cluster front, there was a presentation at the 2007/2008/2009 HP Techforum on a simple little package, for use with OpenVMS clusters, that would allow one to easily migrate single node applications to the other node in the event of a failure. The intent was to provide a way to provision non-cluster aware applications in a cluster.

- Bob Gezelter, http://www.rlgsc.com
Graham Burley
Frequent Advisor

Re: "Standby" Node Management Advice

Doing this without clustering is just not safe. I suggest you have a physical power switch/interlock that prevents both systems being powered on at the same time.
Andy Bustamante
Honored Contributor

Re: "Standby" Node Management Advice

There is a serious risk in having an on call first responder attempt to put both systems into service at the same time. Reconsider justification for cluster licenses.

That said, I would configure a unique system disk for each ES-45. The fail over server starts and mounts no secondary disks. Document your fail over procedure to change the boot device on BOTH systems to place the fail over server into production. Having the server running will let you know if it suffers a hardware problem. You add the risk of having secondary disks mounted.

Make sure the risks are documented and management understands the trade off. Short term savings now can lead to down time and lost data costs in the future.

You can do this, is it a good idea. "It depends."
If you don't have time to do it right, when will you have time to do it over? Reach me at first_name + "." + last_name at sysmanager net
Jim_McKinney
Honored Contributor

Re: "Standby" Node Management Advice

> Doing this without clustering is just not safe.

I agree with this.

I presume that the two nodes will also use the same value for BOOT_OSFLAGS as well as the same BOOTDEF_DEV?

If so,

and you configure this system as a cluster member,

and these systems share a common network that will pass SCS traffic,

then the VMS clustering software would force a second node to CLUEXIT bugcheck very early on in the boot process as it would be attempting to boot with the same SCSNODE and SCSSYSTEMID (regardless of votes, etc).

A cluster can't have two nodes with the same identity and won't let it happen...
Hoff
Honored Contributor

Re: "Standby" Node Management Advice

Though left unstated, I'm guessing you're looking at this scheme because of the cost of the clustering license PAKs here, and you're (still) looking at Alpha hardware because of some particular and unspecified dependencies.

If that's the case, that implies you might also want to look at what a different operating system might offer you here. Or at re-working the application to work around the lack of a cluster PAK, though that, well, duplication of what clustering provides is going to cost some money. (On the other hand, a good cluster design might also mean your code might or can be more platform-portable, too.)

Easiest? Install an operating system platform that's suited for this particular design. That's not VMS. Sure, you can hack VMS to work here, particularly with the mess that is host names and - if the host and peripheral devices aren't identical - the hardware configuration. And the risk of collisions. But you'll be fighting the OS. VMS really wants cluster-aware designs, and wants to run both boxes in parallel, and wants to use both of the boxes rather than letting one sit cold.

What do you need to look at to force-fit VMS into this design, from the OS-level? Careful management of host-specific device names. At maintenance of the NIC addresses, and the rest of the networking set-up. And a decision of whether you're eventually going to move to clustering, or stay with this scheme. And interlocks to ensure both boxes aren't booted in parallel (and that the applications themselves aren't operating in parallel), as others have mentioned.

Another available approach here can be a way to do a fast install or a fast recovery scheme from your own archives (and using InfoServer or such), and involving your own system disk images and maintenance of your code and your data. This requires some maintenance of the existing system and application environment; eliminating the ad-hoc operations of the platform from the environment may or may not be feasible here.

And FWIW, if you have solved the sequential-boot cold-standby scheme that you're aiming for, you've largely also implemented the fast load scheme, too.

Most of what's here is an application programming question, really. Designing a cold-start, single-host, non-server-parallel application. And either ignoring, or explicitly bypassing most of what VMS provides here. Probably the best way to do that is to ignore as much or all of VMS; as much of it as you can. (Which also then leads you to look at whether you even need or want VMS underneath there, too.)
RBrown_1
Trusted Contributor

Re: "Standby" Node Management Advice

Is it possible to configure the SAN so that the disks can only be connected by one system at a time?
Jan van den Ende
Honored Contributor

Re: "Standby" Node Management Advice

Jack,

really, if this is really what is needed, your safest, (and in the end probably cheapest as well) solution is a cluster license. Set up each node to boot from the same root of the same device.
You will essentially be running one-node cluster, but most importantly, in one single pass you will be GUARANTEED to never run both nodes simultanuous. (as already mentioned, that would lead to CLUE$EXIT before damage can be done). That includes safeguarding against ANY potential errenuous OR MALICIOUS
or plain stupid wrong actions.

Consider the cost of trying to think of, write, and implement someting that TRIES to reach something similar, and you will find the licence is actually the cheapest option.

just my EUR 0.02...

Proost.

Have one on me.

jpe
Don't rust yours pelled jacker to fine doll missed aches.
John Gillings
Honored Contributor

Re: "Standby" Node Management Advice

Jack,

Your basic assumption here seems to be that any system failure will be of the ES45 box, rather than (say) the disks, SAN, power, aircon, network, etc... You may want to look at other failure scenarios to make sure you're not putting all your recovery eggs in just one (unlikely?) basket.

Without clustering you have no way to detect another node booting from the same disk, and no protection against corruption.

Perhaps you should present the disks to only ONE node at a time? Part of your recovery procedure includes switching the presentation at the SAN level. If you can't see the disk, you can't boot from it, or corrupt it.
A crucible of informative mistakes
Art Wiens
Respected Contributor

Re: "Standby" Node Management Advice

All good info so far ... what makes me want a bit more info is - you pluralized "system disks":

"2) All disks (including system disks) will be on an HP SAN"

Was that intentional or a typo? Do you plan to have some sort of CA (Continuous Access)going on to another site? If you bite the bullet on Cluster licenses, while you're begging you should ask for Shadowing licenses and make that bit of the DR recovery "easier" too.

Cheers,
Art