Operating System - OpenVMS
1753822 Members
9131 Online
108805 Solutions
New Discussion юеВ

Changing votes in a cluster without a cluster reboot

 
Hoff
Honored Contributor

Re: Changing votes in a cluster without a cluster reboot

> Hence, the use of qourum disk is to ensure that even if multiple nodes go down (worst case - only one node may be up), then also the cluster would be up. This would not be possible if there were no quorum disk.

To clarify Murali's statement: the use of the quorum disk trades off slightly longer cluster transition times to avoid the (usually much longer) operator-initiated quorum transitions. (Three times your quorum disk polling interval, which means you can transition in, say, 3 to 9 seconds with a quorum disk with suitably stable networks and associated quorum-related system parameters; with polling at, say, 1 to 3 seconds.)

An experienced and savvy operations staff can degrade a cluster without the use of a quorum disk, but will need to adjust votes on the way down and must manually remove the blade-guards that avoid partitioning when dropping into or below a quorum=2 configuration.

And the manual transitions are for naught if you can't sequence orderly host shutdowns for whatever reason.

Both the quorum hangs and these manual quorum adjustment can adversely effect applications. During the transition and prior to the manual input, processes that need quorum (which is most of them) will be blocked by the OpenVMS scheduler. (In the more ancient of times, IPL was used to block activity during the quorum hang. That changed at V5.2.)
Andy Bustamante
Honored Contributor

Re: Changing votes in a cluster without a cluster reboot

>>> An experienced and savvy operations staff can degrade a cluster without the use of a quorum disk . . .

One utility which allows allows you to force quorum to be recalculated on the fly is Availability Manager. See http://h71000.www7.hp.com/openvms/products/availman/index.html. AM won't help force quorum when rebooting a cluster.



If you don't have time to do it right, when will you have time to do it over? Reach me at first_name + "." + last_name at sysmanager net
John McL
Trusted Contributor

Re: Changing votes in a cluster without a cluster reboot

Okay, you are saying that multiple node failure down to just one node should not be enough to stop a cluster being formed.

Personally I think the minimum acceoptable number of nodes is an issue that's defined according to business requirements. If the same applications were available on all nodes of a 4-node cluster then there'd be a lot of redundancy if provision was made for 3 nodes to fail leaving 1 node to run all tasks for that application.

Maybe applications are only available on certain nodees, in which case the failure of some nodes would disable certain applications, which is fine if you can live with it but maybe some applications are critical.

It all comes down to a business issue of what services should continue when there's a partial failure, bearing in mind that the failures could be any nodes. A totally homogenous cluster - all applications available on all nodes - has advantages in the case of node failure. Setting expected votes to just over half of the total machines is a reasonable general policy for a homogenous environment because unless you have a lot of redundancy the remaining nodes will likely be getting overloaded because they'll have double the workload.

The important issue is to configure the votes and expected votes in a way that corresponds to the requirements of the business.
Carleen Nutter
Advisor

Re: Changing votes in a cluster without a cluster reboot

Thanks to all who responded. I am grateful for the wealth of knowlege provided by forum members.

This cluster should, indeed, remain up with only 1 node (and the QDSK votes). As John indicated, the business requirements dictate this setup.

The cluster resides in a lights-out data center, so immediate operator intervention in forcing a recalculation of quorum is not an option. We have no operations staff at all, savvy or otherwise. Any response to a cluster problem is via a page to an on-call person, and off-hours, response can be at best 10-15 minutes which is too long to rely on maual intervention.