Re: cluster node hangs when another node shutdown

albert000 · ‎09-18-2016

Dear : we meet another problem when we configure the OpenVMS clusters in OpenVMS 8.4 Update1000 of IA64 architecture. I have configured a 2-nodes OpenVMS cluster as follows: For HWNOD1, it uses a san storage as system disk. For HWNOD2, it use local disk as system disk. $1$DGA1 is the system disk from SAN storage. $1$DGA3 is the quorum disk from SAN storage. However each time when I shutdown one cluster node, the other node will fails to response any command. It will hangs any command I have entered until the shutdowned node starts up again. Could you tell me how to solve this problem? Looking forward to your reply. BR TONG

Bob Blunt · ‎09-18-2016

You'd get much more valuable responses from the forums related to OpenVMS System Management. However... You need to be more complete in your description of your cluster. While you've set up two nodes to use individual system disks this, alone, will NOT make your system more redundant or resilient. A cluster of OpenVMS systems is not like (m)any other clustering technologies and one of the core concepts you need to research and learn is that of "*QUORUM*" which, put simply, is a voting scheme that determines if you have enough votes (votes are assigned to both nodes and, in some cases, a disk which is A) expected to be present and (generally) B) shared between the two systems in the cluster.

Sharing the output from the SHOW CLUSTER utility doesn't present itself well in these forums. I would recommend, instead, providing the output from the SYSMAN utility (from the command prompt: $ MC SYSMAN which requires a priviledged account):

SYSMAN> PARAM SHOW /CLUSTER

SYSMAN> PARAM SHOW /SCS

It would also be beneficial to look into the documentation with a focus on OpenVMS Cluster configurations and OpenVMS system management.

Please understand that there are features and configuration issues or items that a general forum can't decide FOR you. You and your company or organization need to know how these systems work, what their strengths are and how their setup and configuration can best support your collective requirements. While "we" could help you with what appear to be simple questions like what you're asking above they're really NOT simple and must be properly setup for the configuration you need and how you expect that cluster to act.

I would recommend working with local HP resources if you need more in-depth guidance and/or training, frankly.

bob

albert000 · ‎09-20-2016

Dear Bob:

The detail information of the cluster is as follows:

SYSMAN> PARAM SHOW/CLUSTER

%SYSMAN-I-USEACTNOD, a USE ACTIVE has been defaulted on node HWNOD1
Node HWNOD1: Parameters in use: ACTIVE
Parameter Name Current Default Minimum Maximum Unit Dynamic

-------------- ------- ------- ------- ------- ---- -------
VAXCLUSTER 2 1 0 2 Coded-value
EXPECTED_VOTES 2 1 1 127 Votes
VOTES 1 1 0 127 Votes
DISK_QUORUM "$1$DGA3 " " " " " "ZZZZ" Ascii
QDSKVOTES 1 1 0 127 Votes
QDSKINTERVAL 3 3 1 32767 Seconds
ALLOCLASS 5 0 0 255 Pure-number
LOCKDIRWT 1 0 0 255 Pure-number
CLUSTER_CREDITS 32 32 10 128 Credits
NISCS_CONV_BOOT 0 0 0 1 Boolean
NISCS_LOAD_PEA0 1 0 0 1 Boolean
NISCS_USE_LAN 1 1 0 1 Boolean
NISCS_USE_UDP 1 0 0 1 Boolean
MSCP_LOAD 1 0 0 16384 Coded-value
TMSCP_LOAD 0 0 0 3 Coded-value
MSCP_SERVE_ALL 1 4 0 -1 Bit-Encoded
TMSCP_SERVE_ALL 0 0 0 -1 Bit-Encoded
MSCP_BUFFER 1024 1024 256 -1 Coded-value
MSCP_CREDITS 32 32 2 1024 Coded-value
TAPE_ALLOCLASS 0 0 0 255 Pure-number
NISCS_MAX_PKTSZ 8192 8192 576 9180 Bytes
CWCREPRC_ENABLE 1 1 0 1 Bitmask D
RECNXINTERVAL 20 20 1 32767 Seconds D
NISCS_PORT_SERV 0 0 0 256 Bitmask D
NISCS_UDP_PORT 0 0 0 65535 Pure-number D
NISCS_UDP_PKTSZ 8192 8192 576 9000 Bytes
MSCP_CMD_TMO 0 0 0 2147483647 Seconds D
LOCKRMWT 5 5 0 10 Pure-number D

SYSMAN>
SYSMAN> PARAM SHOW/SCS
Node HWNOD1: Parameters in use: ACTIVE
Parameter Name Current Default Minimum Maximum Unit Dynamic
-------------- ------- ------- ------- ------- ---- -------
SCSBUFFCNT 512 50 0 32767 Entries
SCSRESPCNT 1000 1000 0 32767 Entries
SCSMAXDG 576 576 28 985 Bytes
SCSMAXMSG 216 216 60 985 Bytes
SCSSYSTEMID 1025 0 0 -1 Pure-number
SCSSYSTEMIDH 0 0 0 -1 Pure-number
SCSNODE "HWNOD1 " " " " " "ZZZZ" Ascii
PASTDGBUF 16 4 1 16 Buffers
SMCI_PORTS 1 1 0 -1 Bitmask
TIMVCFAIL 1600 1600 100 65535 10Ms D
SCSFLOWCUSH 1 1 0 16 Credits D
PRCPOLINTERVAL 30 30 1 32767 Seconds D
PASTIMOUT 5 5 1 99 Seconds D
PANUMPOLL 16 16 1 223 Ports D
PAMAXPORT 32 32 0 223 Port-number D
PAPOLLINTERVAL 5 5 1 32767 Seconds D
PAPOOLINTERVAL 15 15 1 32767 Seconds D
PASANITY 1 1 0 1 Boolean D
PANOPOLL 0 0 0 1 Boolean D
SMCI_FLAGS 0 0 0 -1 Bitmask D

SYSMAN>
SYSMAN> EXIT
$ SH DEV D
Device Device Error Volume Free Trans Mnt
Name Status Count Label Blocks Count Cnt
$1$DGA1: (HWNOD1) Mounted 0 HWNOD1 153010864 359 1
$1$DGA2: (HWNOD1) Online 0
$1$DGA3: (HWNOD1) Online 0
$1$DGA8: (HWNOD1) Online 0
$1$DGA9: (HWNOD1) Online 0
$1$DGA23: (HWNOD1) Online 0
$5$DKA100: (HWNOD1) Online 0
$5$DKA200: (HWNOD1) Mounted 0 (remote mount) 1
$5$DNA0: (HWNOD1) Offline 0
$5$DNA1: (HWNOD1) Online wrtlck 0

$

Looking forward to your reply.

Thanks very much.

BR

TONG

Steven Schweda · ‎09-20-2016

> SYSMAN> PARAM SHOW/CLUSTER
> [...]
> Node HWNOD1: [...]

Ok. And what do you see on the other node?

Also, which node hangs when you shut down which node?

albert000 · ‎09-20-2016

Dear

The other node has the exactly same output as node1 except its SCSNODE is HWNOD2 not HWNOD1.

I have done two test. When HWNOD1 shutdowns, HWNOD2 will hang. When HWNOD2 shutdowns,HWNOD1 will hangs two.

Now this cluster has set the expected_votes to 2, and it has 1 quorum disk.

Is that correct?

I found the cluster can't add more quorum disk, when I use command "@sys$manager:cluster_config" to enable more quorum disk, the quorum disk set before will disappear.

BR

TONG

Steven Schweda · ‎09-20-2016

> The other node has the exactly same output as node1 except its SCSNODE
> is HWNOD2 not HWNOD1.

I'd prefer to see the actual output, and make my own comparison.

> Now this cluster has set the expected_votes to 2, and it has 1 quorum
> disk.

If each of the two nodes has one vote, and the quorum disk has one
vote, then I'd expect EXPECTED_VOTES to be three. The quorum would be
two, so either node plus the quorum disk would, together, have two
votes, which would satisfy the quorum requirement. As Mike Kier said in
your 6898169 thread:

> Your system should have EXPECTED_VOTES = 3 and a QUORUM of 2 with each
> node having a VOTE of 1, unless there is some compelling reason
> otherwise.

> I found the cluster can't add more quorum disk, [...]

Having more than one quorum disk would cause more trouble than it
would solve. The "OpenVMS Cluster Systems" manual explains quorums and
"a" (or "the") quorum disk:

Rules: Each OpenVMS Cluster system can include only one quorum
disk. [...]

Also:

o To permit recovery from failure conditions, the quorum disk
must be mounted by all disk watchers.

o The OpenVMS Cluster can include only one quorum disk.

Are you mounting the quorum disk on each of the cluster member systems?

Have you looked at the "OpenVMS Cluster Systems" manual?
http://h20565.www2.hpe.com/hpsc/doc/public/display?docId=emr_na-c04623183

albert000 · ‎09-20-2016

Dear:

I have checked the cluster released by OpenVMS and HDS, and found the quorum disk is only used in 2-nodes cluster. and only 1 quorum disk is recommended.

I also find that expected_vote and quorum disk number has the following relation:

estimated quorum = (EXPECTED_VOTES + 2)/2

I have tried to added quorum disk, and it fails.

Since quorum disk can't be changed, I have tried to change expected_votes to 1. However when the system rebooted, it outputs a error, and the expected_votes is changed back automatically.

What can I do in this case?

Looking forward to your reply.

BR

TONG

Steven Schweda · ‎09-20-2016

> I have tried to added quorum disk, and it fails.

Eh? As usual, showing actual commands with their actual output can
be more helpful than vague descriptions or interpretations. (Do you
mean that you couldn't add a _second_ quorum disk? That restriction is
documented. You can't do that.)

> Since quorum disk can't be changed,

Why do you want to change it? What is it now? To what would you
like to change it?

> I have tried to change expected_votes to 1.

There's little sense in setting EXPECTED_VOTES to some unrealistic
value.

> However when the system rebooted, it outputs a error,

Should we guess what that error message was, or are you willing to
tell us?

> and the expected_votes is changed back automatically.

That's why there's little sense in setting EXPECTED_VOTES to some
unrealistic value. The cluster software can (and does) count the VOTES
of the cluster members when they join the cluster. Trying to fool it
with an unrealistic EXPECTED_VOTES value is a waste of time and effort.
Why are you trying to set it to 1 (when it should be 3)?

> What can I do in this case?

I don't know what "this case" is. As before, I'd like to see what
the following parameters are for each of the two nodes:

VAXCLUSTER
EXPECTED_VOTES (And, if it's not 3, why not?)
VOTES          (And, if it's not 1, why not?)
DISK_QUORUM     (And, if it's not the same on both nodes, why not?)
QDSKVOTES        (And, if it's not 1, why not?)

> Are you mounting the quorum disk on each of the cluster member
> systems?

Still wondering.

albert000 · ‎09-21-2016

Dear:

I use the following commands to change the Expected_Votes to 3:

$ RUN SYS$SYSTEM:SYSMAN
SYSMAN> SET ENVIRONMENT/CLUSTER
SYSMAN> PARAM USE CURRENT
SYSMAN> PARAM SET EXPECTED_VOTES 3
SYSMAN> PARAM WRITE CURRENT
SYSMAN> SET ENVIRONMENT/CLUSTER
SYSMAN> DO @sys$UPDATE:AUTOGEN GETDATA SETPARAMS
SYSMAN> EXIT

The method is found from the following address:

http://h30266.www3.hp.com/odl/i64os/opsys/vmsos84/4477/4477pro_020.html#post_config

Then I restart the cluster. The node with a SAN storage disk as system disk always meet a problem as follows:

**** OpenVMS I64 Operating System V8.4 -BUGCHECK ****

**Bugcheck code =000001cc: INVEXCEPTN, Exception while above ASTDEL

** Crash CPU:00000000 Primary CPU: 00000000 Node Name:HWNOD1

**Highest CPU number:00000007

**Active CPUs:00000000.000000FF

**Current Process:NULL

**Current PSB ID:00000001

**Imange Name:

Is the method I change the Expected_votes right?

BR

TONG

Volker Halle · ‎09-21-2016

Tong,

you've probably overwritten the desired value of EXPECTED_VOTES by running AUTOGEN again.

The INVEXCEPTN crash may have NOTHING to do with clustering at all. Can you boot HWNOD1 from SAN storage ($1$DGA1) as the only node in the cluster ? Did a QUORUM file get generated on $1$DGA3:[000000]QUORUM.DAT - did you ever MOUNT the quorum disk ?

Check with $ SHOW CLUSTER/CONT, then type ADD CLUSTER. What's shown in the QF_VOTE column ?

Can HWNOD2 DIRECTLY access the quorum disk $1$DGA3: (i.e. does HWNOD2 have a fibre channel connection) ?

Volker.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: cluster node hangs when another node shutdown

cluster node hangs when another node shutdown