Re: cluster node hangs when another node shutdown

albert000 · ‎09-18-2016

Dear : we meet another problem when we configure the OpenVMS clusters in OpenVMS 8.4 Update1000 of IA64 architecture. I have configured a 2-nodes OpenVMS cluster as follows: For HWNOD1, it uses a san storage as system disk. For HWNOD2, it use local disk as system disk. $1$DGA1 is the system disk from SAN storage. $1$DGA3 is the quorum disk from SAN storage. However each time when I shutdown one cluster node, the other node will fails to response any command. It will hangs any command I have entered until the shutdowned node starts up again. Could you tell me how to solve this problem? Looking forward to your reply. BR TONG

Bob Blunt · ‎09-18-2016

You'd get much more valuable responses from the forums related to OpenVMS System Management. However... You need to be more complete in your description of your cluster. While you've set up two nodes to use individual system disks this, alone, will NOT make your system more redundant or resilient. A cluster of OpenVMS systems is not like (m)any other clustering technologies and one of the core concepts you need to research and learn is that of "*QUORUM*" which, put simply, is a voting scheme that determines if you have enough votes (votes are assigned to both nodes and, in some cases, a disk which is A) expected to be present and (generally) B) shared between the two systems in the cluster.

Sharing the output from the SHOW CLUSTER utility doesn't present itself well in these forums. I would recommend, instead, providing the output from the SYSMAN utility (from the command prompt: $ MC SYSMAN which requires a priviledged account):

SYSMAN> PARAM SHOW /CLUSTER

SYSMAN> PARAM SHOW /SCS

It would also be beneficial to look into the documentation with a focus on OpenVMS Cluster configurations and OpenVMS system management.

Please understand that there are features and configuration issues or items that a general forum can't decide FOR you. You and your company or organization need to know how these systems work, what their strengths are and how their setup and configuration can best support your collective requirements. While "we" could help you with what appear to be simple questions like what you're asking above they're really NOT simple and must be properly setup for the configuration you need and how you expect that cluster to act.

I would recommend working with local HP resources if you need more in-depth guidance and/or training, frankly.

bob

albert000 · ‎09-20-2016

Dear Bob:

The detail information of the cluster is as follows:

SYSMAN> PARAM SHOW/CLUSTER

%SYSMAN-I-USEACTNOD, a USE ACTIVE has been defaulted on node HWNOD1
Node HWNOD1: Parameters in use: ACTIVE
Parameter Name Current Default Minimum Maximum Unit Dynamic

-------------- ------- ------- ------- ------- ---- -------
VAXCLUSTER 2 1 0 2 Coded-value
EXPECTED_VOTES 2 1 1 127 Votes
VOTES 1 1 0 127 Votes
DISK_QUORUM "$1$DGA3 " " " " " "ZZZZ" Ascii
QDSKVOTES 1 1 0 127 Votes
QDSKINTERVAL 3 3 1 32767 Seconds
ALLOCLASS 5 0 0 255 Pure-number
LOCKDIRWT 1 0 0 255 Pure-number
CLUSTER_CREDITS 32 32 10 128 Credits
NISCS_CONV_BOOT 0 0 0 1 Boolean
NISCS_LOAD_PEA0 1 0 0 1 Boolean
NISCS_USE_LAN 1 1 0 1 Boolean
NISCS_USE_UDP 1 0 0 1 Boolean
MSCP_LOAD 1 0 0 16384 Coded-value
TMSCP_LOAD 0 0 0 3 Coded-value
MSCP_SERVE_ALL 1 4 0 -1 Bit-Encoded
TMSCP_SERVE_ALL 0 0 0 -1 Bit-Encoded
MSCP_BUFFER 1024 1024 256 -1 Coded-value
MSCP_CREDITS 32 32 2 1024 Coded-value
TAPE_ALLOCLASS 0 0 0 255 Pure-number
NISCS_MAX_PKTSZ 8192 8192 576 9180 Bytes
CWCREPRC_ENABLE 1 1 0 1 Bitmask D
RECNXINTERVAL 20 20 1 32767 Seconds D
NISCS_PORT_SERV 0 0 0 256 Bitmask D
NISCS_UDP_PORT 0 0 0 65535 Pure-number D
NISCS_UDP_PKTSZ 8192 8192 576 9000 Bytes
MSCP_CMD_TMO 0 0 0 2147483647 Seconds D
LOCKRMWT 5 5 0 10 Pure-number D

SYSMAN>
SYSMAN> PARAM SHOW/SCS
Node HWNOD1: Parameters in use: ACTIVE
Parameter Name Current Default Minimum Maximum Unit Dynamic
-------------- ------- ------- ------- ------- ---- -------
SCSBUFFCNT 512 50 0 32767 Entries
SCSRESPCNT 1000 1000 0 32767 Entries
SCSMAXDG 576 576 28 985 Bytes
SCSMAXMSG 216 216 60 985 Bytes
SCSSYSTEMID 1025 0 0 -1 Pure-number
SCSSYSTEMIDH 0 0 0 -1 Pure-number
SCSNODE "HWNOD1 " " " " " "ZZZZ" Ascii
PASTDGBUF 16 4 1 16 Buffers
SMCI_PORTS 1 1 0 -1 Bitmask
TIMVCFAIL 1600 1600 100 65535 10Ms D
SCSFLOWCUSH 1 1 0 16 Credits D
PRCPOLINTERVAL 30 30 1 32767 Seconds D
PASTIMOUT 5 5 1 99 Seconds D
PANUMPOLL 16 16 1 223 Ports D
PAMAXPORT 32 32 0 223 Port-number D
PAPOLLINTERVAL 5 5 1 32767 Seconds D
PAPOOLINTERVAL 15 15 1 32767 Seconds D
PASANITY 1 1 0 1 Boolean D
PANOPOLL 0 0 0 1 Boolean D
SMCI_FLAGS 0 0 0 -1 Bitmask D

SYSMAN>
SYSMAN> EXIT
$ SH DEV D
Device Device Error Volume Free Trans Mnt
Name Status Count Label Blocks Count Cnt
$1$DGA1: (HWNOD1) Mounted 0 HWNOD1 153010864 359 1
$1$DGA2: (HWNOD1) Online 0
$1$DGA3: (HWNOD1) Online 0
$1$DGA8: (HWNOD1) Online 0
$1$DGA9: (HWNOD1) Online 0
$1$DGA23: (HWNOD1) Online 0
$5$DKA100: (HWNOD1) Online 0
$5$DKA200: (HWNOD1) Mounted 0 (remote mount) 1
$5$DNA0: (HWNOD1) Offline 0
$5$DNA1: (HWNOD1) Online wrtlck 0

$

Looking forward to your reply.

Thanks very much.

BR

TONG

Steven Schweda · ‎09-20-2016

> SYSMAN> PARAM SHOW/CLUSTER
> [...]
> Node HWNOD1: [...]

Ok. And what do you see on the other node?

Also, which node hangs when you shut down which node?

albert000 · ‎09-20-2016

Dear

The other node has the exactly same output as node1 except its SCSNODE is HWNOD2 not HWNOD1.

I have done two test. When HWNOD1 shutdowns, HWNOD2 will hang. When HWNOD2 shutdowns,HWNOD1 will hangs two.

Now this cluster has set the expected_votes to 2, and it has 1 quorum disk.

Is that correct?

I found the cluster can't add more quorum disk, when I use command "@sys$manager:cluster_config" to enable more quorum disk, the quorum disk set before will disappear.

BR

TONG

Steven Schweda · ‎09-20-2016

> The other node has the exactly same output as node1 except its SCSNODE
> is HWNOD2 not HWNOD1.

I'd prefer to see the actual output, and make my own comparison.

> Now this cluster has set the expected_votes to 2, and it has 1 quorum
> disk.

If each of the two nodes has one vote, and the quorum disk has one
vote, then I'd expect EXPECTED_VOTES to be three. The quorum would be
two, so either node plus the quorum disk would, together, have two
votes, which would satisfy the quorum requirement. As Mike Kier said in
your 6898169 thread:

> Your system should have EXPECTED_VOTES = 3 and a QUORUM of 2 with each
> node having a VOTE of 1, unless there is some compelling reason
> otherwise.

> I found the cluster can't add more quorum disk, [...]

Having more than one quorum disk would cause more trouble than it
would solve. The "OpenVMS Cluster Systems" manual explains quorums and
"a" (or "the") quorum disk:

Rules: Each OpenVMS Cluster system can include only one quorum
disk. [...]

Also:

o To permit recovery from failure conditions, the quorum disk
must be mounted by all disk watchers.

o The OpenVMS Cluster can include only one quorum disk.

Are you mounting the quorum disk on each of the cluster member systems?

Have you looked at the "OpenVMS Cluster Systems" manual?
http://h20565.www2.hpe.com/hpsc/doc/public/display?docId=emr_na-c04623183

albert000 · ‎09-20-2016

Dear:

I have checked the cluster released by OpenVMS and HDS, and found the quorum disk is only used in 2-nodes cluster. and only 1 quorum disk is recommended.

I also find that expected_vote and quorum disk number has the following relation:

estimated quorum = (EXPECTED_VOTES + 2)/2

I have tried to added quorum disk, and it fails.

Since quorum disk can't be changed, I have tried to change expected_votes to 1. However when the system rebooted, it outputs a error, and the expected_votes is changed back automatically.

What can I do in this case?

Looking forward to your reply.

BR

TONG

Steven Schweda · ‎09-20-2016

> I have tried to added quorum disk, and it fails.

Eh? As usual, showing actual commands with their actual output can
be more helpful than vague descriptions or interpretations. (Do you
mean that you couldn't add a _second_ quorum disk? That restriction is
documented. You can't do that.)

> Since quorum disk can't be changed,

Why do you want to change it? What is it now? To what would you
like to change it?

> I have tried to change expected_votes to 1.

There's little sense in setting EXPECTED_VOTES to some unrealistic
value.

> However when the system rebooted, it outputs a error,

Should we guess what that error message was, or are you willing to
tell us?

> and the expected_votes is changed back automatically.

That's why there's little sense in setting EXPECTED_VOTES to some
unrealistic value. The cluster software can (and does) count the VOTES
of the cluster members when they join the cluster. Trying to fool it
with an unrealistic EXPECTED_VOTES value is a waste of time and effort.
Why are you trying to set it to 1 (when it should be 3)?

> What can I do in this case?

I don't know what "this case" is. As before, I'd like to see what
the following parameters are for each of the two nodes:

VAXCLUSTER
EXPECTED_VOTES (And, if it's not 3, why not?)
VOTES          (And, if it's not 1, why not?)
DISK_QUORUM     (And, if it's not the same on both nodes, why not?)
QDSKVOTES        (And, if it's not 1, why not?)

> Are you mounting the quorum disk on each of the cluster member
> systems?

Still wondering.

albert000 · ‎09-21-2016

Dear:

I use the following commands to change the Expected_Votes to 3:

$ RUN SYS$SYSTEM:SYSMAN
SYSMAN> SET ENVIRONMENT/CLUSTER
SYSMAN> PARAM USE CURRENT
SYSMAN> PARAM SET EXPECTED_VOTES 3
SYSMAN> PARAM WRITE CURRENT
SYSMAN> SET ENVIRONMENT/CLUSTER
SYSMAN> DO @sys$UPDATE:AUTOGEN GETDATA SETPARAMS
SYSMAN> EXIT

The method is found from the following address:

http://h30266.www3.hp.com/odl/i64os/opsys/vmsos84/4477/4477pro_020.html#post_config

Then I restart the cluster. The node with a SAN storage disk as system disk always meet a problem as follows:

**** OpenVMS I64 Operating System V8.4 -BUGCHECK ****

**Bugcheck code =000001cc: INVEXCEPTN, Exception while above ASTDEL

** Crash CPU:00000000 Primary CPU: 00000000 Node Name:HWNOD1

**Highest CPU number:00000007

**Active CPUs:00000000.000000FF

**Current Process:NULL

**Current PSB ID:00000001

**Imange Name:

Is the method I change the Expected_votes right?

BR

TONG

Volker Halle · ‎09-21-2016

Tong,

you've probably overwritten the desired value of EXPECTED_VOTES by running AUTOGEN again.

The INVEXCEPTN crash may have NOTHING to do with clustering at all. Can you boot HWNOD1 from SAN storage ($1$DGA1) as the only node in the cluster ? Did a QUORUM file get generated on $1$DGA3:[000000]QUORUM.DAT - did you ever MOUNT the quorum disk ?

Check with $ SHOW CLUSTER/CONT, then type ADD CLUSTER. What's shown in the QF_VOTE column ?

Can HWNOD2 DIRECTLY access the quorum disk $1$DGA3: (i.e. does HWNOD2 have a fibre channel connection) ?

Volker.

albert000 · ‎09-21-2016

Dear:

Both HWNOD1 and HWNOD2 can access the quorum disk directly.

When I boot only HWNOD1 from SAN storage system, then it will enter the following status:

%PEA0,cluster communication enabled on IP interface, WE0

%PEA0,successfully initialized with TCP/IP services

%PEA0,setting socket option failed.

It will always hang on this step until I boot the other node:HWNOD2, then it can enter the system.

I use TCP/IP for these two node to communicate with each other.

Is it wrong?

Looking forward to your reply.

BR

TONG

Steven Schweda · ‎09-21-2016

> It will always hang on this step until I boot the other node:HWNOD2,
> then it can enter the system.

That suggests (to me) that the quorum disk is not doing its job.
Previous questions about your quorum disk remain unanswered.

> %PEA0,setting socket option failed.

> I use TCP/IP for these two node to communicate with each other.
>
> Is it wrong?

I've never used IP for the cluster interconnect, so I know nothing,
but...

I don't like the "setting socket option failed" message, but if the
cluster works with both nodes up, then the cluster interconnect would
seem to be working properly.

Steven Schweda · ‎09-21-2016

> > It will always hang on this step until I boot the other node:HWNOD2,
> > then it can enter the system.
>
> That suggests (to me) that the quorum disk is not doing its job.

I can't remember if I ever used a quorum disk in a cluster, so I know
nothing, but...

The documentation suggests that "the quorum disk must be mounted by
all disk watchers". The system (boot) disk is mounted by the boot
procedure, but if the quorum disk is mounted by the normal start-up
scripts (like SYSTARTUP_VMS.COM), then it won't be available until the
system is (mostly) up (_after_ forming or joining the cluster).

If that's true, then the quorum disk would be useless in _forming_
the cluster; its only value would be in maintaining the quorum when one
of the cluster members _leaves_ the cluster.

So, the question would be this: After both nodes have been booted
(and are cluster members, and have mounted the quorum disk with its
QUORUM.DAT file), if you shut down one of the cluster members, does the
other cluster member continue to work, or does the cluster lose its
quorum, and freeze the remaining cluster member?

albert000 · ‎09-22-2016

Dear:

I re-install the 2-nodes with votes=3.

Then I mount the quorum disk with command: mount /noassist /cluster devname vol_label. The cluster info is as follows;

+-------------------------------------------------------------------------------
| CLUSTER
+--------+-----------+----------+---------+------------+-------------------+----
| CL_EXP | CL_QUORUM | CL_VOTES | QF_VOTE | CL_MEMBERS | FORMED | LA
+--------+-----------+----------+---------+------------+-------------------+----
| 3 | 2 | 3 | YES | 2 | 22-SEP-2016 11:40 | 22-
+--------+-----------+----------+---------+------------+-------------------+----

Now I restart any node, the other node can still work well.

Thanks for your help.

BR

TONG

Steven Schweda · ‎09-22-2016

> I re-install the 2-nodes with votes=3.

"with votes=3"? Does that mean one vote (VOTES = 1) for each node,
plus one vote for the cluster disk (QDSKVOTES = 1), so EXPECTED_VOTES =
3? If not, then what does it mean?

Volker Halle · ‎09-24-2016

It should be noted, that you don't need to have the QUORUM disk mounted, once the quorum file ([000000]QUORUM.DAT) has been successfully created, but to CREATE the quorum file after the initial cluster configuration, the quorum disk MUST be mounted system-wide on one of the quorum disk watcher nodes at least once (with the cluster up and running without the quorum disk votes, i.e. QF_VOTE=NO)

Volker.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: cluster node hangs when another node shutdown

cluster node hangs when another node shutdown