Operating System - OpenVMS
1829558 Members
1721 Online
109992 Solutions
New Discussion

Temporarily removing a cluster member

 
SOLVED
Go to solution
Art Wiens
Respected Contributor

Temporarily removing a cluster member

I need to shutdown one VAX in our cluster for ~4 or 5 hours in order to replace the internal system disk (and the internal backup disk). It's actually the backup disk that has logged "many" errors, but I want to get rid of both internal and put a SW shelf on it. Also add a second backup disk so that the backups can be staggered on alternate days.

Is there a limitation (SYSGEN setting?) as to how long a cluster member can be removed from a cluster if the shutdown option REMOVE_NODE is used?

As a workaround if there is an issue, I thought I might be able to come up minimum off the backup disk (assuming it can still boot) to rejoin the cluster and do the image of the system disk to the "new" SW disk.

Cheers,
Art
17 REPLIES 17
Wim Van den Wyngaert
Honored Contributor
Solution

Re: Temporarily removing a cluster member

There is no limitation.

But when the remaining node reboots while your node is down, it depends on the configuration if it will reboot (expected votes/quorum disk settings).

Wim
Wim
Jan van den Ende
Honored Contributor

Re: Temporarily removing a cluster member

Art,


Is there a limitation (SYSGEN setting?) as to how long a cluster member can be removed from a cluster if the shutdown option REMOVE_NODE is used?


For all practical uses, NO

One minor issue:

IF during your member-down period you do reboot another member, then EXPEXTED_VOTES is reset, effectively undoing the REMOVE_NODE.

This might, or might not, lead to a situation where a system crash may leed to a hang, depending on your total quorum scheme.

Success.

Proost.

Have one on me.

jpe

Don't rust yours pelled jacker to fine doll missed aches.
Uwe Zessin
Honored Contributor

Re: Temporarily removing a cluster member

Starting with OpenVMS V6.2, the REMOVE_NODE option works (finally!) and will adjust quorum properly. You can safely shutdown a multi-node cluster without any hangs/ qourum losses, one-by-one until you shutdown the last one (we tried it in 1997 on multi-node, 2 datacenter cluster).
.
Volker Halle
Honored Contributor

Re: Temporarily removing a cluster member

Art,

there is no such limit. You can shutdown this node with or without using the REMOVE_NODE option. REMOVE_NODE is used to reduce quorum (if the local node has VOTES > 0) and using DISMOUNT/CLUSTER when dismounting locally connected served disks.

If there are locally-connected disks on that node being MSCP-served cluster-wide and you don't dismount them from the other nodes, you'll risk the disks going into mount-verification (after MVTIMEOUT seconds) on the other nodes.

Volker.
Uwe Zessin
Honored Contributor

Re: Temporarily removing a cluster member

Jan,
I might misunderstand your words, but if a new system with VOTES joins the cluster, then the QUORUM is increased. EXPEcTED_VOTES is used during boot to calculate the required QUORUM [*]. A value too high cannot cause a running cluster's quorum loss by increasing the quorum value.

I know it, because I accidently tried to boot a system into a running cluster with EXPECTED_VOTEs being to high. The cluster software did not let the system join the cluster. There was no helpful error message, so it took me some time to find it out.


[*] The system manager was originally required to calculate the QUORUM value on his/her own, but VMS engineering made it easier by introducting the EXPECTED_VOTES parameter and asking tothe system manager to sum up the VOTES.
.
Art Wiens
Respected Contributor

Re: Temporarily removing a cluster member

Thanks. The local system disk is not mounted on any other nodes. I'm not anticipating any other nodes rebooting while I'm doing this (they haven't for ~300 days, why would they now ;-), and I have no problems with votes and quorum, I have rebooted individual nodes and everything carries on. I just can't recall having one node out for such a long duration.

Thanks again...

Art
Ian Miller.
Honored Contributor

Re: Temporarily removing a cluster member

If the disks on the system being shut down are mounted on other systems in the cluster then it is best to dismount them cluster wide
(In the past I have forgotton to do this).
____________________
Purely Personal Opinion
Dirk Bogaerts
Frequent Advisor

Re: Temporarily removing a cluster member

Art,

have done this several times on my 2-node Alpha cluster for hardware maintenance (and even VMS upgrade), as the whole cluster cannot be stopped for more than 30'. However, during shutdown of 1 node, I always have short hangs of about 1 to 2 minutes on the whole cluster, probably during quorum re-adjustment. It's still somewhere on my 'to do' list, to check if something can be done about this (maybe reduce the hang period). Should normally not be a big issue, unless you have some real-time stuff running on the surviving node. Have also learned to disconnect the shared HSZ SCSI bus, as INIT-ing the node in maintenance, resets the SCSI bus which briefly 'interrupts' normal activity on the active node; also producing some benign device errors on the HSZ-disks.
Check also if you don't have 'connections' which don't failover automatically; I still have a modem connectected on the serial port of one node; needs to be re-plugged by hand. Finally, this might be the occasion that you'll discover some small config differences (had once a terminal server connection which failed if a specific node was the only active one; small typo error in the LAT-config).

Dirk
Jan van den Ende
Honored Contributor

Re: Temporarily removing a cluster member

Dirk,


However, during shutdown of 1 node, I always have short hangs of about 1 to 2 minutes on the whole cluster, probably during quorum re-adjustment. It's still somewhere on my 'to do' list, to check if something can be done about this (maybe reduce the hang period).


Reduce your SYSGEN param RECNXINTERVAL.
(it is DYNAMIC).
Please set it to the same values on all nodes!

Success

Proost.

Have one on me.

jpe
Don't rust yours pelled jacker to fine doll missed aches.
Uwe Zessin
Honored Contributor

Re: Temporarily removing a cluster member

Most likely that is from multiple SCSI resets and the servers will loose access to the quorum disk. I'd check the value of QDSKINTERVAL. The default used to be 10 seconds, but has been lowered to 3 for some time now. Still, the cluster software waits multiple QDSKINTERVAL to make sure that the contents of the quorum file are stable.
.
Wim Van den Wyngaert
Honored Contributor

Re: Temporarily removing a cluster member

Dirk,

Are you sure you specified REMOVE_NODE ?
I just tested it on a 4000 (6.2) and there was no delay (I did a dir every 5 seconds).

RECNX is used only for a crash case or a shutdown without remove_node.

Wim
Wim
Jan van den Ende
Honored Contributor

Re: Temporarily removing a cluster member

Wim,

You are right.

Dirk, ignore my previous posting.

Proost.

Have one on me.

jpe
Don't rust yours pelled jacker to fine doll missed aches.
Uwe Zessin
Honored Contributor

Re: Temporarily removing a cluster member

I'd be very surprised if a cluster member that is shut down without REMOVE_NODE does _not_ send a 'last gasp' datagram, but lets the other nodes run into a timeout (I have never seen that).

The primary purpose is:
-----------------------
Shutdown options (enter as a comma-separated list):
REMOVE_NODE ___ Remaining nodes in the cluster should adjust quorum
.
Wim Van den Wyngaert
Honored Contributor

Re: Temporarily removing a cluster member

Jan,

I was wrong too. Uwe is right.

It's only used when crashing. I always use remove_node but I now tested it without it.

Dirk : if it takes minutes I would say that something is wrong. Check the operator log file and verify that it really took that long.

Wim
Wim
comarow
Trusted Contributor

Re: Temporarily removing a cluster member

Hopefully you have enough votes to stay up without the troublesome node. If not I would recommend reconfiguring your cluster so you can survive the loss of a node.

Whatever you do, don't boot the node with VAXcluster=0 and mount any shared disks or you will corrupt them.

Assuming the rest of your cluster has enough votes, it can survive with the remove node, until it rejoins the cluster.
Uwe Zessin
Honored Contributor

Re: Temporarily removing a cluster member

What's you point, Bob?

Even if the cluster have the same number of votes like the quorum value - all it takes is V6.2+ and using the REMOVE_NODE option on shutdown - it will properly adjust the quorum, so that the cluster keeps working without that node.

If the cluster doesn't have 'enough votes' - it is hanging and you can't properly shutdown ANY system anyway! You first need to tell the cluster SW to recalculate the quorum value, e.g. via IPC>
.
Art Wiens
Respected Contributor

Re: Temporarily removing a cluster member

I had the node down for ~1.5 hours while I did the image backup and removed the old disks ... no issue.

Thanks All,
Art