- Community Home
- >
- Servers and Operating Systems
- >
- Operating Systems
- >
- Operating System - OpenVMS
- >
- Re: "Standby" Node Management Advice
Categories
Company
Local Language
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Discussions
Discussions
Discussions
Forums
Forums
Discussions
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Community
Resources
Forums
Blogs
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО10-20-2010 08:11 AM
тАО10-20-2010 08:11 AM
I will be configuring a stand-alone node, but with a dedicated, cold-standy back up system. Some quick info:
1) 2 ES45s
2) All disks (including system disks) will be on an HP SAN
3) All disks will be presented to both nodes
4) The BOOTDEF_DEV of each node will point to the same system disks
5) If the active node fails, the recovery procedure will be to confirm that it is "down", then boot the "cold" node.
5) We expect to do a monthly test of shutting down the active node & booting the "cold" node.
My request: what processes/procedures can I put in place to reduce (if not eliminate) the possibility that the "cold" node will be booted while the active node is running (which, I expect, would immediately corrupt the system disk, etc)
TIA
Solved! Go to Solution.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО10-20-2010 08:32 AM
тАО10-20-2010 08:32 AM
SolutionI will first make the somewhat obligatory observation that this is one of the most basic uses of OpenVMS clusters.
That said, I would not use the same system disk for both systems. I would also make sure that the procedures ensure that the "dead" CPU is permanently disabled, an accidental boot could be a serious problem.
There are few safeguards that one can construct without using OpenVMS clusters. Off the cuff, one could do something with multi-homed IP and PING with each of the systems having a unique IP address and trying to reach the other node. If the other node responds, STOP. However, this is a bit of a hack and it only guards against startup while the other node is running If network connectivity is lost, but the SAN remains operational, one is back where one started from.
On the OpenVMS cluster front, there was a presentation at the 2007/2008/2009 HP Techforum on a simple little package, for use with OpenVMS clusters, that would allow one to easily migrate single node applications to the other node in the event of a failure. The intent was to provide a way to provision non-cluster aware applications in a cluster.
- Bob Gezelter, http://www.rlgsc.com
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО10-20-2010 09:21 AM
тАО10-20-2010 09:21 AM
Re: "Standby" Node Management Advice
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО10-20-2010 09:34 AM
тАО10-20-2010 09:34 AM
Re: "Standby" Node Management Advice
That said, I would configure a unique system disk for each ES-45. The fail over server starts and mounts no secondary disks. Document your fail over procedure to change the boot device on BOTH systems to place the fail over server into production. Having the server running will let you know if it suffers a hardware problem. You add the risk of having secondary disks mounted.
Make sure the risks are documented and management understands the trade off. Short term savings now can lead to down time and lost data costs in the future.
You can do this, is it a good idea. "It depends."
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО10-20-2010 10:24 AM
тАО10-20-2010 10:24 AM
Re: "Standby" Node Management Advice
I agree with this.
I presume that the two nodes will also use the same value for BOOT_OSFLAGS as well as the same BOOTDEF_DEV?
If so,
and you configure this system as a cluster member,
and these systems share a common network that will pass SCS traffic,
then the VMS clustering software would force a second node to CLUEXIT bugcheck very early on in the boot process as it would be attempting to boot with the same SCSNODE and SCSSYSTEMID (regardless of votes, etc).
A cluster can't have two nodes with the same identity and won't let it happen...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО10-20-2010 10:30 AM
тАО10-20-2010 10:30 AM
Re: "Standby" Node Management Advice
If that's the case, that implies you might also want to look at what a different operating system might offer you here. Or at re-working the application to work around the lack of a cluster PAK, though that, well, duplication of what clustering provides is going to cost some money. (On the other hand, a good cluster design might also mean your code might or can be more platform-portable, too.)
Easiest? Install an operating system platform that's suited for this particular design. That's not VMS. Sure, you can hack VMS to work here, particularly with the mess that is host names and - if the host and peripheral devices aren't identical - the hardware configuration. And the risk of collisions. But you'll be fighting the OS. VMS really wants cluster-aware designs, and wants to run both boxes in parallel, and wants to use both of the boxes rather than letting one sit cold.
What do you need to look at to force-fit VMS into this design, from the OS-level? Careful management of host-specific device names. At maintenance of the NIC addresses, and the rest of the networking set-up. And a decision of whether you're eventually going to move to clustering, or stay with this scheme. And interlocks to ensure both boxes aren't booted in parallel (and that the applications themselves aren't operating in parallel), as others have mentioned.
Another available approach here can be a way to do a fast install or a fast recovery scheme from your own archives (and using InfoServer or such), and involving your own system disk images and maintenance of your code and your data. This requires some maintenance of the existing system and application environment; eliminating the ad-hoc operations of the platform from the environment may or may not be feasible here.
And FWIW, if you have solved the sequential-boot cold-standby scheme that you're aiming for, you've largely also implemented the fast load scheme, too.
Most of what's here is an application programming question, really. Designing a cold-start, single-host, non-server-parallel application. And either ignoring, or explicitly bypassing most of what VMS provides here. Probably the best way to do that is to ignore as much or all of VMS; as much of it as you can. (Which also then leads you to look at whether you even need or want VMS underneath there, too.)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО10-20-2010 11:06 AM
тАО10-20-2010 11:06 AM
Re: "Standby" Node Management Advice
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО10-20-2010 12:07 PM
тАО10-20-2010 12:07 PM
Re: "Standby" Node Management Advice
really, if this is really what is needed, your safest, (and in the end probably cheapest as well) solution is a cluster license. Set up each node to boot from the same root of the same device.
You will essentially be running one-node cluster, but most importantly, in one single pass you will be GUARANTEED to never run both nodes simultanuous. (as already mentioned, that would lead to CLUE$EXIT before damage can be done). That includes safeguarding against ANY potential errenuous OR MALICIOUS
or plain stupid wrong actions.
Consider the cost of trying to think of, write, and implement someting that TRIES to reach something similar, and you will find the licence is actually the cheapest option.
just my EUR 0.02...
Proost.
Have one on me.
jpe
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО10-20-2010 12:27 PM
тАО10-20-2010 12:27 PM
Re: "Standby" Node Management Advice
Your basic assumption here seems to be that any system failure will be of the ES45 box, rather than (say) the disks, SAN, power, aircon, network, etc... You may want to look at other failure scenarios to make sure you're not putting all your recovery eggs in just one (unlikely?) basket.
Without clustering you have no way to detect another node booting from the same disk, and no protection against corruption.
Perhaps you should present the disks to only ONE node at a time? Part of your recovery procedure includes switching the presentation at the SAN level. If you can't see the disk, you can't boot from it, or corrupt it.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО10-20-2010 05:54 PM
тАО10-20-2010 05:54 PM
Re: "Standby" Node Management Advice
"2) All disks (including system disks) will be on an HP SAN"
Was that intentional or a typo? Do you plan to have some sort of CA (Continuous Access)going on to another site? If you bite the bullet on Cluster licenses, while you're begging you should ask for Shadowing licenses and make that bit of the DR recovery "easier" too.
Cheers,
Art
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО10-20-2010 07:53 PM
тАО10-20-2010 07:53 PM
Re: "Standby" Node Management Advice
bob
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО10-21-2010 04:43 AM
тАО10-21-2010 04:43 AM
Re: "Standby" Node Management Advice
We had a similiar situation when we were testing new replacement servers with SAN attached storage.
We would only present the storage to the server we wanted to boot on the storage array (EVA and EMC arrays).
When we wanted to boot the new server for testing a script would be run on the storgae arrays to unpresent the disks from one server and present them to the other server.
Of course this doesn't stop someone from executing the script to move the disks between the servers when it's not intended but that would be an operational process that would need to be implemented. We were in a test enviornment though. We also made sure we had valid backups available.
Quick and dirty way if clustering isn't an option. Just an extra step of performing storage masking on the array when both servers are down.
HTH,
Bob
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО10-21-2010 07:23 AM
тАО10-21-2010 07:23 AM
Re: "Standby" Node Management Advice
I was thinking more along the lines of the way RAxx disk drives worked. They had two ports, but could only be connected on one port at a time. Once system A connected to the drive on Port A, then system B's attempts to to connect on Port B would not succeed. This worked well for us when we had an active system and a warm standby with data disks to be connected to whichever system was active.
But I don't know if a modern SAN can be made to work this way.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО10-21-2010 07:35 AM
тАО10-21-2010 07:35 AM
Re: "Standby" Node Management Advice
We attached a serial T-switch to both nodes' serial ports, and put a serial loop-back connector (free from DEC with every computer!) on the common connector. Both nodes were on all the time ("warm" instead of "cold"). When the nodes booted, they transmitted a string to the T-switch; the one that got the string echoed back was the "primary". The nodes also connected to one another task-to-task to ensure we didn't have two "primaries". The other node was the "secondary". The boot process then defined a logical name that contained the node's status, and that logical was used to determine which site-specific start-up files were run, and also to alert users if they happened to log into the wrong machine.
This approach allowed us to automatically back-up application status to the secondary, and also to fail-over without rebooting if we wished. But it was simple enough for a non-technical person to perform a fail-over if necessary.
Now, at the time we did this, using clusters would have meant spending a lot more on hardware as well as software. But this solution worked well enough that we're still using it today on the same application, abet on slighter newer VAXen.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО10-21-2010 10:09 AM
тАО10-21-2010 10:09 AM
Re: "Standby" Node Management Advice
Dan
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО10-25-2010 08:35 AM
тАО10-25-2010 08:35 AM
Re: "Standby" Node Management Advice
Some quick answers:
- will be running third party app so can't change code
- vendor does not support clusters
- cost of cluster license not an issue - just looking for quick way to recover from failure of a critical system
That said:
You've all convinced to me of the error of my design. Since we do have monthly boots scheduled (how rare these days!), we will include a quarterly switch between nodes to test the standby node.