Operating System - OpenVMS
1752657 Members
5724 Online
108788 Solutions
New Discussion

Booting two nodes with same root system disk....

 
SOLVED
Go to solution
Robert Gezelter
Honored Contributor

Re: Booting two nodes with same root system disk....

macero,

I must completely concur with Hoff. This is an engraved invitation to a nightmare.

The cost of the cluster license is FAR, FAR less than what it can cost to recover if both machines happen to be booted at the same time.

- Bob Gezelter, http://www.rlgsc.com
Martin Hughes
Regular Advisor

Re: Booting two nodes with same root system disk....

The first thing I would do is set your auto_action to HALT on your standby node. And perhaps go one step further and set your boot_flags to something spurious, such as z,0. You can change these values back when you need to boot the standby node in a controlled situation.

Then I would look at the configurations suggested earlier involving a quorum disk or 3rd node. If your application is important enough to warrant the purchase of a spare ES45 then I would assume that it is worthy of a proper cluster configuration.

Just curious, when you have a "cold" standby node, if it sits at chevron for 12 months, how do you know it will work when you need it?.
For the fashion of Minas Tirith was such that it was built on seven levels, each delved into a hill, and about each was set a wall, and in each wall was a gate. (J.R.R. Tolkien). Quote stolen from VAX/VMS IDSM 5.2
JBR
Frequent Advisor

Re: Booting two nodes with same root system disk....

Thankyou to everybody for your answers. I´ve understood importance to set correctly votes.

The final configuration will be:

First ES45 Active (votes=1, os_flags 0,0)
Second ES45 Active (votes=1, os_flags 1,0)
Quorum disk, qdskvotes=1
Expected Votes=3

When I need to make a failover .... Shutdown first one, shutdown second one, boot second ES45 from sys0 and boot first from sys1.

But if I perform two simultaneous boots on two ES45 from root sys0, by error .... Will they to continue to crash with cluexit bugcheck? I suppose that the answer is YES.



Robert Gezelter
Honored Contributor

Re: Booting two nodes with same root system disk....

Macero,

With all due respect, I do not agree with the proposed procedure for recovering in the event of a failure of the primary system. Shutting down each node and rebooting the alternate node from the primary's system root means a restart time on the order of minutes, at best.

The way that OpenVMS clusters were designed to failover uses the fact that the functions can be failed over to another cluster member without the need to restart the other cluster member. This can be initiated automatically, using a program or a background batch job; or it can be done manually using a command procedure specifically tailored for that particular application. It is important to realize that "failure" is not a all or nothing proposition, it is quite possible to move one group or one application from one node in a cluster to another, without shifting all of the load to the other, or in the case of a cluster that is larger than two active members, re-distribute the load among the other members.

My recommendation would be to carefully look at the application and how it interacts. In working with clients to architect, design, and implement OpenVMS clusters since they were first announced in 1982, I have almost always found solutions that did not require the re-boot of the entire cluster.

I have also found that solutions that require manual intervention tend to be highly prone to operational error and mistakes, and are best to be avoided as much as possible.

I hope that the above is helpful. If I have been unclear, please feel free to followup in this forum or privately.

- Bob Gezelter, http://www.rlgsc.com
JBR
Frequent Advisor

Re: Booting two nodes with same root system disk....

Hi Bob, thank for your quickly reply, but the application only can run in one node because it has a restriction due "Decnet Mac Address" (BaseStar + PLC´s + PL/I, etc) and, at the moment, due application design, it´s impossible to use Decnet Cluster Alias.
I can assume a lost of service by minutes but I need to assure the integrity of data. (In the past, NO CLUSTER, one node Active, other StandBy "P00>>", by error, boot two nodes simultaneous from same system disk = data corruption)

My question is simple: Having a OpenVMS cluster, Can I assure that booting two nodes simoultaneous from the same system disk & same root (sys0) will produce a cluexit bugcheck in one node and no data corruption?

Thank you

Best Regards.

Robert Gezelter
Honored Contributor

Re: Booting two nodes with same root system disk....

Macero,

With all due respect, I would want to personally check that dependency on DECnet address. I have been told of many such dependencies in the past, and have found that most were an illusion. Also, I have dealt with many applications that stated "This cannot be run in a cluster". Investigation revealed that the actual restriction was "Can only be run on one cluster node at a time." (Which I was able to implement without a problem, thus creating fast failover without human intervention).

I do not have a cluster handy at this instant that I can use to verify what you ask.

- Bob Gezelter, http://www.rlgsc.com
Hoff
Honored Contributor

Re: Booting two nodes with same root system disk....

DECnet MAC address? Do research it. That could easily be a misunderstanding about having two NICs active using DECnet Phase IV on the same LAN, and there are easy and effective ways to deal with that.

And your failover scheme is likely to be more difficult, as there tend to be small differences -- NIC hardware addresses, et al -- that can mean booting a different node from an existing and established root can be problematic.

If I had to swap DECnet addresses, I'd set up a way to swap DECnet address. This can be done without rebooting, too. Rebooting and nodes sharing roots is not a solution I would typically recommend -- and this is an inherently risky solution, in my opinion.

I'd also encourage a more general configuration review here, too. There can be other opportunities to improve uptime, and reduce overhead.

Stephen Hoffman
HoffmanLabs LLC