Operating System - OpenVMS
1753971 Members
8137 Online
108811 Solutions
New Discussion юеВ

Re: "Standby" Node Management Advice

 
SOLVED
Go to solution
Bob Blunt
Respected Contributor

Re: "Standby" Node Management Advice

Jack, one of the OpenVMS Ambassadors, Eddie Orcutt, has put together software to handle a similar type scenario you're needing. He may have also presented this at Connect symposia. I haven't seen him watching the OpenVMS forum on ITRC but I'll send him an email pointer to your original entry. Enquiring and enterprising minds could probably also conjugate his HP email address but keep in mind he's in sales support so can be out of pocket...

bob
Bob Vida
Advisor

Re: "Standby" Node Management Advice

Expanding on what rbrown responded with.

We had a similiar situation when we were testing new replacement servers with SAN attached storage.
We would only present the storage to the server we wanted to boot on the storage array (EVA and EMC arrays).
When we wanted to boot the new server for testing a script would be run on the storgae arrays to unpresent the disks from one server and present them to the other server.

Of course this doesn't stop someone from executing the script to move the disks between the servers when it's not intended but that would be an operational process that would need to be implemented. We were in a test enviornment though. We also made sure we had valid backups available.

Quick and dirty way if clustering isn't an option. Just an extra step of performing storage masking on the array when both servers are down.

HTH,
Bob
RBrown_1
Trusted Contributor

Re: "Standby" Node Management Advice

Bob Vida suggested running a script on the storage array to configure which system can access the disks.

I was thinking more along the lines of the way RAxx disk drives worked. They had two ports, but could only be connected on one port at a time. Once system A connected to the drive on Port A, then system B's attempts to to connect on Port B would not succeed. This worked well for us when we had an active system and a warm standby with data disks to be connected to whichever system was active.

But I don't know if a modern SAN can be made to work this way.
Kelly Stewart_1
Frequent Advisor

Re: "Standby" Node Management Advice

We took a similar approach long ago (long enough ago that it was initially done on MicroVAX 2000's):

We attached a serial T-switch to both nodes' serial ports, and put a serial loop-back connector (free from DEC with every computer!) on the common connector. Both nodes were on all the time ("warm" instead of "cold"). When the nodes booted, they transmitted a string to the T-switch; the one that got the string echoed back was the "primary". The nodes also connected to one another task-to-task to ensure we didn't have two "primaries". The other node was the "secondary". The boot process then defined a logical name that contained the node's status, and that logical was used to determine which site-specific start-up files were run, and also to alert users if they happened to log into the wrong machine.

This approach allowed us to automatically back-up application status to the secondary, and also to fail-over without rebooting if we wished. But it was simple enough for a non-technical person to perform a fail-over if necessary.

Now, at the time we did this, using clusters would have meant spending a lot more on hardware as well as software. But this solution worked well enough that we're still using it today on the same application, abet on slighter newer VAXen.
abrsvc
Respected Contributor

Re: "Standby" Node Management Advice

A setup used for me requires a cluster license as well, but only 1 machine has the majority of licensing for products. I have a 2 node cluster over Fibre with the 2nd node only present to serve a set of disks for 3 member shadow sets. The primary system boots off of Root 0 (DSA0) and the second boots off of Root 1 (copy of DSA0 on a single drive). Since the second machine is a disaster recovery machine, it does not run any applications nor is it available to the users. If the main machine fails, human intervention is required because of instruments directly connected to the machine. Along with "moving" the instruments, the human at this point would halt the 2nd machine and boot it as machine 1. Once machine 1 is "repaired", it is booted as machine 2 off of root 1 (after a copy of DSA0 to a blank disk. This maintains the single machine boot off of the drive as well as keeping the system up most of the time.

Dan
Jack Trachtman
Super Advisor

Re: "Standby" Node Management Advice

I would like to thank everyone for taking the time to provide lengthy responses.

Some quick answers:
- will be running third party app so can't change code
- vendor does not support clusters
- cost of cluster license not an issue - just looking for quick way to recover from failure of a critical system

That said:

You've all convinced to me of the error of my design. Since we do have monthly boots scheduled (how rare these days!), we will include a quarterly switch between nodes to test the standby node.