Operating System - OpenVMS
cancel
Showing results for 
Search instead for 
Did you mean: 

Three node OpenVMS cluster hanging issues

 
SOLVED
Go to solution
Ronald Russik
Occasional Advisor

Three node OpenVMS cluster hanging issues

Good afternoon all,

I have a three node cluster all running OpenVMS V7.3-2 on ES45's. They are clustered via the LAN. Each time I shutdown (REM,REB) a node, the other nodes hang until the node is booted and the VAX cluster state transisition completes. On each system; VOTES = 1, EXPECTED_VOTES = 3, QUORUM_DISK = $1$DGA110, VAXCLUSTER = 2. We are in need of the quorum disk in case two of the three nodes are down, this way at least one node will maintain the cluster.

Question: Why does the two nodes hang until the third node is booted?

Thank you,
Ron Russik
Ron.Russik@yrcw.com
13 REPLIES 13
Jon Pinkley
Honored Contributor

Re: Three node OpenVMS cluster hanging issues

Expected votes shouldn't be 3 if you have three nodes with 1 vote each and a quorum disk with at least 1 vote.

Normally if you want a quorum disk, you would set its votes to nodes-1, and each node to one vote, and expected votes to (num_nodes*2)-1

In your 3 node case:

Each node 1 vote
Quorum disk 2 votes
Expected votes 5
which leave quorum at 3.

What do you have for quorum disk votes?

Jon
it depends
Phillip Thayer
Esteemed Contributor

Re: Three node OpenVMS cluster hanging issues

Rob,

You have your EXPECTED_VOTES set too high. Use the formula that Jon gave in his answer. Make sure you put it in as hardcoded values in MODPARAMS, do an AUTOGEN with a reboot and you should be o.k. The formula he gave will allow two of the nodes to be down and the third to stay up.

Phil@Vital
Once it's in production it's all bugs after that.
Jon Pinkley
Honored Contributor

Re: Three node OpenVMS cluster hanging issues

Ron,

Since VMS will protect itself as well as possible, the expected votes will ratchet up. So if you have a quorum disk with 1 vote, and each node has 1 vote, expected votes will be bumped to 4 as soon as the third node joins the cluster. That will make quorum = 3. As long as every node has a consistent set of cluster related sysgen parameters, and each node has a direct link to the quorum disk, the cluster should survive the unexpected loss of 1 node. There would be a temporary hang during cluster transition, but the remaining nodes should remove the member from the cluster and continue.

You stated that the loss of a node caused a permanent hang (with no indication about which node, so I will assume you meant any node). You also stated that this happens even when using a shutdown with the remove_node option, which should trigger the remaining nodes to adjust quorum based on the remaining votes.

Summary: Given the information you provided, you should not be seeing what you have reported. So there must be something unstated that is causing the behavior you are seeing.

Cut and past the following into a file, for example cluster.debug

$ create sys$scratch:show_cluster$init.debug
INITIALIZE
ADD CLUSTER/ALL
ADD TRANSITION_TIME
ADD QUORUM
ADD EXPECTED
ADD QDVOTES
ADD QF_ACTIVE
ADD QF_SAME
ADD QF_WATCHER
SET SCREEN = 132
$ define/user show_cluster$init sys$scratch:show_cluster$init.debug
$ show cluster
$ delete sys$scratch:show_cluster$init.debug;

Then do the following and show us the output.

$ @cluster.debug

Jon
it depends
Phillip Thayer
Esteemed Contributor

Re: Three node OpenVMS cluster hanging issues

Ron,

Also, What system is your Quorum disk connected to? Is it served via MSCP to the other systems? Is it possible that the quorum disk is local to the system your shutting down and consequently the other two systems are losing connectivity to the quorum disk and causing the clsuter to lose quorum?

Phil
Once it's in production it's all bugs after that.
Robert Gezelter
Honored Contributor

Re: Three node OpenVMS cluster hanging issues

Ron,

There are several different possibilities. Access to the quorum disk could be compromised, or different quorum disks could be identified by different nodes (I have seen both, as well as some other problems involving quorum disks that created symptoms similar to what is described in this post).

It is also possible to have incorrectly set voting parameters, or inconsistent voting parameters across the cluster.

The physical configuration can also be a problem. As has been noted, using a served quorum disk can be problematical if one or two machines have the ability to sever all access to the quorum disk.

More data (the settings of each machine with regards to voting and quorum disk access) would, needless to say, be helpful in better understanding this situation.

- Bob Gezelter, http://www.rlgsc.com
Hoff
Honored Contributor
Solution

Re: Three node OpenVMS cluster hanging issues

Set EXPECTED_VOTES to the numbers of VOTES present.

1+1+1+2QD=5, quorum=3.

Post the SHOW /CLUSTER parameters from each of the three nodes; the interesting ones here are:

VAXCLUSTER, EXPECTED_VOTES, VOTES, DISK_QUORUM and QDSKVOTES. Or the SYSMAN PARAM SHOW /CLUSTER output from each, if that's easier.

Ensure each of the three nodes can access $1$DGA110, and MOUNT the disk.

Setting EXPECTED_VOTES too low riskscorruptions with shared resources during cases of partitioning. Don't "game" the settings; set this value to the number of votes that should be present. OpenVMS will correct this setting, once connections are established. Unfortunately, if two lobes cannot connect but both have quorum as EXPECTED_VOTES was "gamed" and set too low, clustering will do what you asked and your shared disks are toast.

Ronald Russik
Occasional Advisor

Re: Three node OpenVMS cluster hanging issues

OHMS03_RRUSSIK mcr sysman set env/cluster
%SYSMAN-I-ENV, current command environment:
Clusterwide on local cluster
Username RRUSSIK will be used on nonlocal nodes

SYSMAN> do mcr sysgen show/cluster
%SYSMAN-I-OUTPUT, command execution on node OHMS03

Parameters in use: Active
Parameter Name Current Default Min. Max. Unit Dynamic
-------------- ------- ------- ------- ------- ---- -------
VAXCLUSTER 2 1 0 2 Coded-valu
EXPECTED_VOTES 5 1 1 127 Votes
VOTES 1 1 0 127 Votes
DISK_QUORUM "$1$DGA110 " " " " " "ZZZZ" Ascii
QDSKVOTES 1 1 0 127 Votes
QDSKINTERVAL 3 3 1 32767 Seconds
ALLOCLASS 1 0 0 255 Pure-numbe
LOCKDIRWT 1 0 0 255 Pure-numbe
CLUSTER_CREDITS 32 32 10 128 Credits
NISCS_CONV_BOOT 0 0 0 1 Boolean
NISCS_LOAD_PEA0 1 0 0 1 Boolean
NISCS_PORT_SERV 0 0 0 3 Bitmask
MSCP_LOAD 1 0 0 16384 Coded-valu
TMSCP_LOAD 0 0 0 3 Coded-valu
MSCP_SERVE_ALL 1 4 0 15 Bit-Encode
TMSCP_SERVE_ALL 0 0 0 15 Bit-Encode
MSCP_BUFFER 1024 1024 256 -1 Coded-valu
MSCP_CREDITS 32 32 2 1024 Coded-valu
TAPE_ALLOCLASS 0 0 0 255 Pure-numbe
SD_ALLOCLASS 0 0 0 255 Pure-numbe
NISCS_MAX_PKTSZ 8192 8192 576 9180 Bytes
NISCS_LAN_OVRHD 0 0 0 256 Bytes
SERVED_IO 0 0 0 0 Obsolete
CWCREPRC_ENABLE 1 1 0 1 Bitmask D
RECNXINTERVAL 20 20 1 32767 Seconds D
MSCP_CMD_TMO 0 0 0 2147483647 Seconds D
%SYSMAN-I-OUTPUT, command execution on node OHMS02

Parameters in use: Active
Parameter Name Current Default Min. Max. Unit Dynamic
-------------- ------- ------- ------- ------- ---- -------
VAXCLUSTER 2 1 0 2 Coded-valu
EXPECTED_VOTES 5 1 1 127 Votes
VOTES 1 1 0 127 Votes
DISK_QUORUM "$1$DGA110 " " " " " "ZZZZ" Ascii
QDSKVOTES 1 1 0 127 Votes
QDSKINTERVAL 3 3 1 32767 Seconds
ALLOCLASS 2 0 0 255 Pure-numbe
LOCKDIRWT 1 0 0 255 Pure-numbe
CLUSTER_CREDITS 32 32 10 128 Credits
NISCS_CONV_BOOT 0 0 0 1 Boolean
NISCS_LOAD_PEA0 1 0 0 1 Boolean
NISCS_PORT_SERV 0 0 0 3 Bitmask
MSCP_LOAD 1 0 0 16384 Coded-valu
TMSCP_LOAD 0 0 0 3 Coded-valu
MSCP_SERVE_ALL 1 4 0 15 Bit-Encode
TMSCP_SERVE_ALL 0 0 0 15 Bit-Encode
MSCP_BUFFER 1024 1024 256 -1 Coded-valu
MSCP_CREDITS 32 32 2 1024 Coded-valu
TAPE_ALLOCLASS 0 0 0 255 Pure-numbe
SD_ALLOCLASS 0 0 0 255 Pure-numbe
NISCS_MAX_PKTSZ 8192 8192 576 9180 Bytes
NISCS_LAN_OVRHD 0 0 0 256 Bytes
SERVED_IO 0 0 0 0 Obsolete
CWCREPRC_ENABLE 1 1 0 1 Bitmask D
RECNXINTERVAL 20 20 1 32767 Seconds D
MSCP_CMD_TMO 0 0 0 2147483647 Seconds D
%SYSMAN-I-OUTPUT, command execution on node OHMS01

Parameters in use: Active
Parameter Name Current Default Min. Max. Unit Dynamic
-------------- ------- ------- ------- ------- ---- -------
VAXCLUSTER 2 1 0 2 Coded-valu
EXPECTED_VOTES 5 1 1 127 Votes
VOTES 1 1 0 127 Votes
DISK_QUORUM "$1$DGA110 " " " " " "ZZZZ" Ascii
QDSKVOTES 1 1 0 127 Votes
QDSKINTERVAL 3 3 1 32767 Seconds
ALLOCLASS 3 0 0 255 Pure-numbe
LOCKDIRWT 1 0 0 255 Pure-numbe
CLUSTER_CREDITS 32 32 10 128 Credits
NISCS_CONV_BOOT 0 0 0 1 Boolean
NISCS_LOAD_PEA0 1 0 0 1 Boolean
NISCS_PORT_SERV 0 0 0 3 Bitmask
MSCP_LOAD 1 0 0 16384 Coded-valu
TMSCP_LOAD 0 0 0 3 Coded-valu
MSCP_SERVE_ALL 1 4 0 15 Bit-Encode
TMSCP_SERVE_ALL 0 0 0 15 Bit-Encode
MSCP_BUFFER 1024 1024 256 -1 Coded-valu
MSCP_CREDITS 32 32 2 1024 Coded-valu
TAPE_ALLOCLASS 0 0 0 255 Pure-numbe
SD_ALLOCLASS 0 0 0 255 Pure-numbe
NISCS_MAX_PKTSZ 8192 8192 576 9180 Bytes
NISCS_LAN_OVRHD 0 0 0 256 Bytes
SERVED_IO 0 0 0 0 Obsolete
CWCREPRC_ENABLE 1 1 0 1 Bitmask D
RECNXINTERVAL 20 20 1 32767 Seconds D
MSCP_CMD_TMO 0 0 0 2147483647 Seconds D
SYSMAN>
Hoff
Honored Contributor

Re: Three node OpenVMS cluster hanging issues

The way this box is (as I think has been mentioned) incorrect around the quorum disk votes; quorum disk votes are 1 and total votes are 4 and quorum will be 3. You likely want 2/5/3 here.

I'll assume all three of these nodes have functional FC and all three have direct access to $1$DGA110:.

As for why this box is hanging awaiting the third node, that implies (dis)connectivity, and here probably around when the quorum disk is manifested to the newly-forming cluster. With EV=5 and no QD connection, you need all 3 nodes present.

For grins (and I'm guessing at several key aspects of this cluster configuration not yet in evidence) configure the quorum disk as the system disk. This assuming the quorum disk is another controller-based FC SAN DG disk, with or without controller-based RAID; that the system here disk is common, FC SAN-based and not host shadowed.

Do also load the current ECO kits; this as a boilerplate response to any weirdness. If you're not current when weirdness arises, get current first and then go hunting for the weirdness.

Robert Gezelter
Honored Contributor

Re: Three node OpenVMS cluster hanging issues

Ron,

I concur with Hoff. Since it was mentioned that it is desired to have a single node runnable as the cluster, then the sum of the votes of a single node and the votes assigned to the quorum disk (QDSKVOTES) must achieve quorum.

- Bob Gezelter, http://www.rlgsc.com
Ronald Russik
Occasional Advisor

Re: Three node OpenVMS cluster hanging issues

Thank you all for your input... I'm on a time line and need to get these three servers ready for UAT. It has been a long time since I've set up a cluster... and I'm digging deep in my memory... so for a three node cluster with a quorum disk;

expected_votes = 5
vaxcluster = 2
disk_quorum = "$1$DGA1112"
votes = 1
qdskvotes = 2
(anything else I might be missing?)

Thank you,
Ron Russik
Ron.Russik@yrcw.com
Jon Pinkley
Honored Contributor

Re: Three node OpenVMS cluster hanging issues

Make sure your sysgen modifications are reflected in your sys$system:modparams.dat file, or any agen$include_params files referenced by the modparams.dat file.

I like to create a file sys$common:[sysexe]agen_cluster_common_modparams.dat that has all the cluster parameters, like votes, expected votes, quorum disk votes, quorum disk name, etc. and then in sys$system:modparams.dat I put a line with

AGEN$INCLUDE_PARAMS SYS$COMMON:[SYSEXE]AGEN_CLUSTER_COMMON_MODPARAMS.DAT

Then I only need to change one file if the cluster values need to change. I have other common include files for site specific, application specific, etc. Then each nodes specific modparams.dat only has the agen$include_params followed by a few items, like node name, SCSSYSTEMID, etc. After a system upgrade, you will need to cleanup each nodes MODPARAMS.DAT, as the upgrade usually appends to it, with values that will supersede anything in the include files, so the method isn't maintenance free.

Jon
it depends
Ronald Russik
Occasional Advisor

Re: Three node OpenVMS cluster hanging issues

Good Afternoon All,

Thank you for all your help in this matter. Your solutions, recommendations, and advice were very helpful in assisting with this issue.

Again, Thank you for all your help.

Ron Russik
Ron.Russik@yrcw.com
Ronald Russik
Occasional Advisor

Re: Three node OpenVMS cluster hanging issues

Set the below sysgen parameters;

expected_votes = 5
vaxcluster = 2
disk_quorum = "$1$DGA1112"
votes = 1
qdskvotes = 2