Operating System - OpenVMS
1753964 Members
7554 Online
108811 Solutions
New Discussion юеВ

Re: One node shutdown on an OpenVMS clustered system

 
SOLVED
Go to solution
CO_5
Advisor

One node shutdown on an OpenVMS clustered system

When performed a secondary node standalone shutdown on a clustered system of 2 nodes, experienced a 'hang' situation on the primary node. May i know why it was so ?
11 REPLIES 11
Volker Halle
Honored Contributor

Re: One node shutdown on an OpenVMS clustered system

CO,

welcome to the OpenVMS ITRC forum.

- did you shutdown the node with the REMOVE_NODE shutdown option ?
- what are the setting of VOTES on the 2 nodes and is a quorum disk in use ?

If you've set up the cluster with 2 nodes with 1 vote each and EXPECTED_VOTES=2 and you shut down one node without REMOVE_NODE, the remaining node will NOT adjust quorum and wait hanging for a second vote.

Volker.
Jan van den Ende
Honored Contributor

Re: One node shutdown on an OpenVMS clustered system

CO,

Welcome from me as well.

Would you care to explain what exactly you mean by "secondary node" , and by "Standalone shutdown"?
Is this a cluster with a bootnode and a satellite, or two equivalent nodes?
In the first case, does the satellite have VOTES > 0 ?
In the latter case, is there a Quorum Disk?
Like Volker asked, what are the values of VOTES (on both nodes) and of EXPECTED VOTES (should be equal to the summ of all votes, including QSKVOTES, and the same on every node).
Most important, did you specify "REMOVE_NODE" as shutdown option?

Please answer these, and we will be able to sort things out.

Success,

Proost.

Have one on me.

jpe
Don't rust yours pelled jacker to fine doll missed aches.
Uwe Zessin
Honored Contributor

Re: One node shutdown on an OpenVMS clustered system

... and remember that this only works in VMS V6.2 and higher (must have been a very complicated problem ;-)
.
CO_5
Advisor

Re: One node shutdown on an OpenVMS clustered system

Hello everyone,

Ok. i have a clustered system (OpenVMS 7.3.1)with 2 nodes: primary and secondary systems. Both have their own ip addresses.

I did not use the REMOVE_NODE shutdown option. If i use the REMOVE_NODE option, will it rejoin back automatically with primary when i boot up again the secondary system.

All i want to do is to ONLY shutdown the secondary system WITHOUT have any impact on the primary system. And then when i am ready, i wnt to rejoin back the 2 system as cluster again.

Currently, when i used SHUTDOWN on the secondary system, the primary system hang for some times. (not sure how long it will last, so i have to restart the secondary machine)

Here is the SHOW CLUSTER info for both Primary and Secondary system.

View of Cluster from system ID 10241 node: MPRI 12-DEC-2005 15:55:12

+-------------------+---------+

| SYSTEMS | MEMBERS |

+--------+----------+---------+

| NODE | SOFTWARE | STATUS |

+--------+----------+---------+

| MPRI | VMS V7.3 | MEMBER |

| MSEC | VMS V7.3 | MEMBER |

+--------+----------+---------+

+-------------------------------------------------------------------------------

| CLUSTER

+--------+-----------+----------+---------+------------+-------------------+----

| CL_EXP | CL_QUORUM | CL_VOTES | QF_VOTE | CL_MEMBERS | FORMED | LA

+--------+-----------+----------+---------+------------+-------------------+----

| 4 | 3 | 4 | NO | 2 | 3-OCT-2005 15:39 | 8-

View of Cluster from system ID 10242 node: MSEC 12-DEC-2005 18:01:51

+-------------------+---------+

| SYSTEMS | MEMBERS |

+--------+----------+---------+

| NODE | SOFTWARE | STATUS |

+--------+----------+---------+

| MSEC | VMS V7.3 | MEMBER |

| MPRI | VMS V7.3 | MEMBER |

+--------+----------+---------+



+-------------------------------------------------------------------------------

| CLUSTER

+--------+-----------+----------+---------+------------+-------------------+----

| CL_EXP | CL_QUORUM | CL_VOTES | QF_VOTE | CL_MEMBERS | FORMED | LA

+--------+-----------+----------+---------+------------+-------------------+----

| 4 | 3 | 4 | NO | 2 | 3-OCT-2005 15:39 | 8-

The device info for MPRI and MSEC are as below:
MPRI:

Device Device Error Volume Free Trans Mnt

Name Status Count Label Blocks Count Cnt

DSA100: Mounted 0 DATA1 53815929 202 2

$10$DKD0: (MPRI) Mounted 0 ALPHASYS_PRI 48014892 489 1

$10$DKD1: (MPRI) ShadowSetMember 0 (member of DSA100:)

$10$DQA0: (MPRI) Online 0

$10$DQA1: (MPRI) Online 1

$10$DQB0: (MPRI) Online 1

$10$DQB1: (MPRI) Online 1

$11$DKD0: (MSEC) Mounted 0 (remote mount) 1

$11$DKD1: (MSEC) ShadowSetMember 0 (member of DSA100:)

$11$DQA0: (MSEC) Online 0

$11$DQA1: (MSEC) Online 0

$11$DQB0: (MSEC) Online 0

$11$DQB1: (MSEC) Online 0

MSEC:
Device Device Error Volume Free Trans Mnt

Name Status Count Label Blocks Count Cnt

DSA100: Mounted 0 DATA1 53815860 1 2

$10$DKD0: (MPRI) Mounted 0 (remote mount) 1

$10$DKD1: (MPRI) ShadowSetMember 0 (member of DSA100:)

$10$DQA0: (MPRI) Online 0

$10$DQA1: (MPRI) Online 0

$10$DQB0: (MPRI) Online 0

$10$DQB1: (MPRI) Online 0

$11$DKD0: (MSEC) Mounted 0 ALPHASYS_SEC 47586126 629 1

$11$DKD1: (MSEC) ShadowSetMember 0 (member of DSA100:)

$11$DQA0: (MSEC) Online 0

$11$DQA1: (MSEC) Online 1

$11$DQB0: (MSEC) Online 1

$11$DQB1: (MSEC) Online 1



One more question: How do i know (besides disks) what other resources are shared between the 2 nodes?

Thanks.
Volker Halle
Honored Contributor

Re: One node shutdown on an OpenVMS clustered system

CO,

it looks like both nodes have 2 VOTES each and there is no quorum disk. This will make EXPECTED_VOTES=4 and QUORUM=3. If you shut down one node WITHOUT the REMOVE_NODE option, the other node will hang with CL_VOTES = 2 < CL_QUORUM = 3 until you bring back the stopped node.

If you use REMOVE_NODE, CL_EXP will be reduced to 2 and CL_QUORUM will be reduced to 2, so the other node can continue.

If you would be able to use a quorum disk (a shared non-shadowed disk directly accessible by both nodes - does not seem possible in your config, which apparently only has local SCSI buses), you could live without the REMOVE_NODE option during shutdown, but it would increase your cluster state transition time a bit.

Did you know about the IPC interrupt for re-calculating quorum on a system hung due to quorum loss ?

It goes like this:

Press HALT to get to the console prompt

>>> D SIRR C
>>> C
IPC> Q
IPC>

Or you could use DECamds / Availability Manager and the Fix Quorum function.

The above is only necessary, if you FORGOT the REMOVE_NODE during shutdown or if one of the systems suddenly breaks down.

Besides the disks (and files on them), the lock manager database is also shared between the 2 nodes.

Volker.
CO_5
Advisor

Re: One node shutdown on an OpenVMS clustered system

Dear Volker,

First, thank very much for your valuable advice. Appreciate it.

If i would like to rejoin the secondary node (which was brought down using REMOVE_NODE option) back to the cluster, does it mean i have to do the below on the secondary machine ? With the below execution, it will bring back the CL_Votes(4) wich will be greater than the quorom value(3), rite ?

Press HALT to get to the console prompt

>>> D SIRR C
>>> C
IPC> Q
IPC>
CO_5
Advisor

Re: One node shutdown on an OpenVMS clustered system

Sorry, what i meant was if i would like to rejoin the secondary node (which was shutdown using REMOVE_NODE) back to the cluster, i just need to boot the secondary machine up without any special options, just like i bring the standalone machine up. Rite ?
John Abbott_2
Esteemed Contributor
Solution

Re: One node shutdown on an OpenVMS clustered system

Yes, just [re]boot the shutdown system using the same system disk/same root. The cluster should spring back to life.

... assuming you havn't made any sysgen paremter changes that might affect the cluster or the system just shutdown.

Kind Regards
John.
Don't do what Donny Dont does
Volker Halle
Honored Contributor

Re: One node shutdown on an OpenVMS clustered system

CO,

as John already said, just boot the secondary node normally and it will join the cluster and the primary will then continue, because there are now enough votes to satisfy quorum.

The IPC interrupt could have been used to revive the primary node, once it hung after shutting down the secondary node (and forgetting to specify the REMOVE_NODE shutdown option).

Volker.