Howto reestablish cluster lock on running cluster?

Ralph Grothe · ‎06-26-2005

Hello,

I yesterday reconfigured this three node cluster.

Because I read in an HP whitepaper about optimizing failover that even for three and four node clusters, though not compelling to achieve a quorum tie break, that they advise to set up a cluster lock disk (or quorum server) I added one to this new configuration.

Due to pressing time constraints I focused on failover tests after reconfiguration without obviously having given too careful attention to entries from cmclconfd.

So this morning I discovered these disturbing entries in the cluster master's syslog.log:

# grep cmcld /var/adm/syslog/syslog.log|tail -3
Jun 27 09:36:40 jupiter cmcld: WARNING: Cluster lock on disk /dev/dsk/c7t0d0 is missing!
Jun 27 09:36:40 jupiter cmcld: Until it is fixed, a single failure could
Jun 27 09:36:40 jupiter cmcld: cause all nodes in the cluster to crash

While yesterday already these missed entries appeared:

Jun 26 11:00:50 jupiter cmclconfd[3970]: Failed to release volume group /dev/vgdat3
Jun 26 11:00:54 jupiter cmclconfd[3970]: Failed to release volume group /dev/vgdat4
Jun 26 11:00:54 jupiter cmclconfd[3970]: Failed to release volume group /dev/vgdat5
Jun 26 11:00:55 jupiter cmclconfd[3970]: Failed to release volume group /dev/vgbz
Jun 26 11:00:55 jupiter cmclconfd[3970]: Failed to release volume group /dev/vgzlb
Jun 26 11:01:28 jupiter cmclconfd[3997]: Initializing cluster lock device /dev/dsk/c7t0d0 for node j
upiter.srz.lit.verwalt-berlin.de
Jun 26 11:01:29 jupiter cmclconfd[3997]: Unable to initialize cluster lock on /dev/dsk/c7t0d0, Volum
e Group /dev/vgdat1 is not activated
Jun 26 11:04:11 jupiter cmclconfd[4051]: Failed to release volume group /dev/vgdat3
Jun 26 11:04:14 jupiter cmclconfd[4051]: Failed to release volume group /dev/vgdat4
Jun 26 11:04:15 jupiter cmclconfd[4051]: Failed to release volume group /dev/vgbz
Jun 26 11:04:15 jupiter cmclconfd[4051]: Failed to release volume group /dev/vgzlb
Jun 26 11:04:51 jupiter cmclconfd[4055]: Initializing cluster lock device /dev/dsk/c7t0d0 for node j
upiter.srz.lit.verwalt-berlin.de

The cluster binary has as far as lock disk is concerned this contents:

# cmviewconf|grep -i -e lock -e node\ name
flags: 12 (single cluster lock)
first lock vg name: /dev/vgdat1
second lock vg name: (not configured)
Node name: jupiter
first lock pv name: /dev/dsk/c7t0d0
first lock disk interface type: fcparray
Node name: neptun
first lock pv name: /dev/dsk/c7t0d0
first lock disk interface type: fcparray
Node name: saturn
first lock pv name: /dev/dsk/c7t0d0
first lock disk interface type: fcparray

I provided by ioinit reboots on one node a now clusterwide consitent instance numbering scheme so that the instance Nos. for driver fcparray (whose HW paths connect the cluster shared PVs) and thus the controller Nos. of lock disk PVs, as they appear in the cmviewconf output above, are all the same.

on node jupiter:

[root@jupiter:/root]
# pvdisplay /dev/dsk/c7t0d0|grep PV\ Name
PV Name /dev/dsk/c7t0d0
PV Name /dev/dsk/c10t0d0 Alternate Link

On node saturn:

[root@saturn:/root]
# vgchange -a r vgdat1 && pvdisplay /dev/dsk/c7t0d0 && vgchange -a n vgdat1
Activated volume group
Volume group "vgdat1" has been successfully changed.
--- Physical volumes ---
PV Name /dev/dsk/c7t0d0
PV Name /dev/dsk/c10t0d0 Alternate Link
VG Name /dev/vgdat1
PV Status available
Allocatable yes
VGDA 2
Cur LV 1
PE Size (Mbytes) 8
Total PE 880
Free PE 0
Allocated PE 880
Stale PE 0
IO Timeout (Seconds) default
Autoswitch On

Volume group "vgdat1" has been successfully changed.

On node neptun:

[root@neptun:/root]
# vgchange -a r vgdat1 && pvdisplay /dev/dsk/c7t0d0 && vgchange -a n vgdat1
Activated volume group
Volume group "vgdat1" has been successfully changed.
--- Physical volumes ---
PV Name /dev/dsk/c7t0d0
PV Name /dev/dsk/c10t0d0 Alternate Link
VG Name /dev/vgdat1
PV Status available
Allocatable yes
VGDA 2
Cur LV 1
PE Size (Mbytes) 8
Total PE 880
Free PE 0
Allocated PE 880
Stale PE 0
IO Timeout (Seconds) default
Autoswitch On

Volume group "vgdat1" has been successfully changed.

You see, the cluster lock PV should be accessable from all cluster nodes.

What went wrong?

Can I reesteblish a lock in a running cluster all will this require a cluster restart?

Rgds.
Ralph

Madness, thy name is system administration

Bernhard Mueller · ‎06-26-2005

Ralph,

you may try the attached binary which is typically used to re-initialized a failed quorum disk after replacement while the cluster remains up and running.

Regards
Bernhard

Bernhard Mueller · ‎06-26-2005

the binary should be named cminitlock

Bernhard Mueller · ‎06-26-2005

usage: cminitlock [-v] [-t] vg_name pv_name
-t Test the cluster lock only.
-v Verbose output.

This command will initialize a cluster.
lock disk and then query the disk to
validate the disk was initialize
successfully. If the -t option is specified,
the cluster lock is only queried.

HTH
Regards,
Bernhard

Rita C Workman · ‎06-26-2005

Well I'm not certain exactly what was done and happened during your cluster, except one thing, and your cluster already told you that. Your cluster lock disk is not being properly seen.

There a few things that can/should only be addressed "properly" with the cluster down:
changing timing parms
changing the max amt of config pkgs
changing IP's
...and yes.....cluster lock

Options I'd recommend are: fix your lock disk -OR- get rid of the lock disk and set up a Quorum Server (even though you only have 3 node cluster, you can do this). Quorum server is easy to install, you can download it from: http://docs.hp.com/en/ha.html#Quorum%20Server

Just my thoughts,
Rita

Ralph Grothe · ‎06-27-2005

Hello Bernhard,

many thanks for supplying me with the right tool.
Before I've received your reply I also filed a SW case at HP.
They also suggested the tool you mentioned.

So I executed this

[root@jupiter:/usr/local/sbin]
# ./cminitlock -v -t /dev/vgdat1 /dev/dsk/c7t0d0
Stating /dev/dsk/c7t0d0
Opening /dev/dsk/c7t0d0
-t flag specificed. Testing the cluster lock only.
Calling inquery lock IOCTL 3
Cluster lock inquiry request succeeded
Checking Cluster lock on /dev/dsk/c7t0d0
Calling query lock IOCTL 3
QUERY Cluster lock ioctl succeeded.
Cluster lock query operation failed, errno 2: No such file or directory
Cluster lock on disk /dev/dsk/c7t0d0 is missing!:No such file or directory
Cluster lock on /dev/dsk/c7t0d0 is not initialized.

[root@jupiter:/usr/local/sbin]
# ./cminitlock -v /dev/vgdat1 /dev/dsk/c7t0d0
Stating /dev/dsk/c7t0d0
Opening /dev/dsk/c7t0d0
Initializing the cluster lock /dev/dsk/c7t0d0
Calling LVM_ASYNC_CLUSTER_LOCK
Checking Cluster lock on /dev/dsk/c7t0d0
Calling query lock IOCTL 3
QUERY Cluster lock ioctl succeeded.
Lock is not Owned.
Cluster lock is initialized.

And finally backed up lvmconf

[root@jupiter:/usr/local/sbin]
# vgcfgbackup /dev/vgdat1
Volume Group configuration for /dev/vgdat1 has been saved in /etc/lvmconf/vgdat1.conf

[root@jupiter:/usr/local/sbin]
# remsh saturn 'PATH=/usr/sbin; vgchange -a r vgdat1 && vgcfgbackup vgdat1 && vgchange -a n vgdat1'
Activated volume group
Volume group "vgdat1" has been successfully changed.
Volume Group configuration for /dev/vgdat1 has been saved in /etc/lvmconf/vgdat1.conf
Volume group "vgdat1" has been successfully changed.

[root@jupiter:/usr/local/sbin]
# remsh neptun 'PATH=/usr/sbin; vgchange -a r vgdat1 && vgcfgbackup vgdat1 && vgchange -a n vgdat1'
Activated volume group
Volume group "vgdat1" has been successfully changed.
Volume Group configuration for /dev/vgdat1 has been saved in /etc/lvmconf/vgdat1.conf
Volume group "vgdat1" has been successfully changed.

I'm not sure if this already has done the trick.

Madness, thy name is system administration

Ralph Grothe · ‎06-27-2005

Hi Rita,

thank you for your suggestions.

About the quorum server I'm not yet clear whether this isn't sort of contradicting.
To me it made no sense that if I used a quorum server as longs as itself weren't highly available, which would calls for yet another cluster or some kind of replication staggering just to be prepared to provide some tie breaker at every time this was required.
This sounds a bit like overkill unless you already have another production cluster that could share this task.
But probably I totally missed the notion of cluster stale mate tie braeking, and quorum servers.

Madness, thy name is system administration

Ralph Grothe · ‎06-27-2005

Forgot, Rita you mentioned I should as well settle the amount of packages.
I've to admid that I left this value at 10 while the cluster currently only hosts 3 packages.
I thought thus I'd be prepared to online add further packages should ever the need arrise.
On the other hand I now consider the downside this may have on resource waste.

Madness, thy name is system administration

Bernhard Mueller · ‎06-27-2005

Ralph,

to me this looks like you lock disk issue is fixed. If it is not you would get frequent messages in the syslog file (I think at least every 6 hours or so).

Leave your # of packages at ten, this does not waste resource and, like you say, you may add packages on the fly.

In most cases it is also considered safer to have a cluster lock disk. There are configurations which make a quorum server more reasonable but you need to take a very close look at network config and failure scenarios.

Regards
Bernhard

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Howto reestablish cluster lock on running cluster?

Howto reestablish cluster lock on running cluster?

Re: Howto reestablish cluster lock on running cluster?

Re: Howto reestablish cluster lock on running cluster?

Re: Howto reestablish cluster lock on running cluster?

Re: Howto reestablish cluster lock on running cluster?

Re: Howto reestablish cluster lock on running cluster?

Re: Howto reestablish cluster lock on running cluster?

Re: Howto reestablish cluster lock on running cluster?

Re: Howto reestablish cluster lock on running cluster?