1830899 Members
2518 Online
110017 Solutions
New Discussion

Re: Node crashed

 
SOLVED
Go to solution
Darren Gibbs
Advisor

Node crashed

Our primary node lost connection with the secondary node in a two node cluster because the secondary node lost a CPU and caused the system to hang. When the primary node attempted to get the cluster lock, it failed with the messages below:

Mar 23 01:39:22 parrot cmcld: Timed out node quail. It may have failed.
Mar 23 01:39:22 parrot cmcld: Attempting to adjust cluster membership
Mar 23 01:39:39 parrot cmcld: Obtaining Cluster Lock
Mar 23 01:39:40 parrot cmcld: WARNING: Cluster lock on disk /dev/dsk/c13t3d3 is missing!
Mar 23 01:39:40 parrot cmcld: Until it is fixed, a single failure could
Mar 23 01:39:40 parrot cmcld: cause all nodes in the cluster to crash
Mar 23 01:39:40 parrot cmcld: Failed to obtain Cluster Lock: I/O error

After 2 1/2 minutes of attempting to obtain a cluster lock, the primary node crashed. What caused this?

There are two things to know about this situation, first the WARNING message about obtaining a cluster lock has been happening for some time but has never caused a crash. I've got the cminitlock utility but haven't had a chance to test yet.

Second, the VG that has the assigned cluster lock disk was not marked as MCSG aware, i.e. vgchange -c y vgxx, and was activated in read only mode instead of exclusive mode.

I'm wondering which one of the above reasons caused the crash?

Our cluster is comprised of two N class servers that are sharing disks from an XP256 via Fibre switches.
5 REPLIES 5
Christopher McCray_1
Honored Contributor

Re: Node crashed

The first thing I would do is a cmscancl and view the output as it pertains to the cluster lock disk. The message is saying that the cluster lock disk that MCSG thinks should be isn't there. By chance,was that disk once the cluster lock disk and then moved away? You will probalbly end up redefining the cluster lock vg/disk and recompiling the cluster binary file (cluster must be down) either by editing the cluster.ascii file or by running cmquerycl again. Remember to save the previous cluster.ascii before doing this.

Good luck

Chris
It wasn't me!!!!
Mark van Hassel
Respected Contributor

Re: Node crashed

Darren,

You have applied the MCSG config with a cluster lock device. The volume group needs to be cluster aware, you specified FIRST_CLUSTER_LOCK_VG and FIRST_CLUSTER_LOCK_PV in the cluster ascii file.
The vg needs to be cluster aware but does not need to be activated, however, when activating read only I can imagine that the cmcld daemon can not write to the lock disk and the node can therefor not obtain the cluster lock.

To make the vg cluster aware do the following:

vgchange -a n vgname
vgchange -c y vgname
(when the cluster is up)

Alternatively, you can add the vg to the cluster ascii file and re-apply the cluster config.

HtH,

Mark
The surest sign that life exists elsewhere in the universe is that none of it has tried to contact us
Sanjay_6
Honored Contributor
Solution

Re: Node crashed

Hi Darren,

The Cluster lock vg has to be SG aware volume group. This volume group should be accessible from both the nodes. Since your lock disk was not sg aware, the 2nd system was unable to get hold of the lock disk when the 1st system crash.

Hope this helps.

Regds
Darren Gibbs
Advisor

Re: Node crashed

I had stated that the VG that houses the cluster lock disk was activated in read only mode instead of exclusive mode on the primary node. I should have stated that the VG was activated using vgchange -a y vgxx, instead of vgchagne -a e vgxx.

My theory is that the true failure was caused by the fact that the VG housing the cluster disk was not MCSG aware at the time of the failure. Could this be?

The reason for it not being MCSG aware was a change mistakenly made by someone else on the team the previous weekend.
Sanjay_6
Honored Contributor

Re: Node crashed

Hi Darren,

You are correct. Since the VG housing the cluster lock disk was made cluster unaware, even though the 1st system where this vg was activated as "vgchange -a y /dev/vg_name" went down the 2nd node was unable to activate the vg and was unable to get hold of the cluster lock disk. The cluster lock vg should never be made cluster unaware.

Hope this helps.

Regds