Operating System - HP-UX
1833178 Members
2650 Online
110051 Solutions
New Discussion

SG unable to lock VG both nodes TOC

 
SOLVED
Go to solution
Juan M Leon
Trusted Contributor

SG unable to lock VG both nodes TOC

We have a cluster with 2 nodes, running one package. we have one vglock drive shared to both nodes the drive has 3 alternate links or paths on each server. The server is able to see the VG and is enable on both nodes. Everything is running fine and we are able to failover manually. We have no problems at that point.
When unplugged the net cables on the running node, the running node crashes. The second node tries to enable the package and it fails causing the second node to crash. After both nodes are up and running. The syslog reports Unable to stat device /dev/dsk/c#t#d# missing.
The shutdownlog file report "Reboot after panic: SafetyTimer Expired, isr.ior".
We verified the hardware path running ioscan and the hardware path is active and claimed.
However we are unable to "vgdisplay vglock", or "vgchange -a y".

In the cluster configuration file we have "FIRST_CLUSTER_LOCK_PV /dev/dsk/c#t#d#" which is the primary path for the vglock. Is there a way to configure the secondary path (alternate link) with in the cluster configuration file?

Any suggestions

Thank you
17 REPLIES 17
Enrico P.
Honored Contributor

Re: SG unable to lock VG both nodes TOC

Hi,
you can add in the cluster configuration file the lines:

SECOND_CLUSTER_LOCK_VG /dev/vglock

and

SECOND_CLUSTER_LOCK_PV Alternate_link (for every nodes)

Verify with the cmcheckconf command

and then run the cmapplyconf

Enrico
Juan M Leon
Trusted Contributor

Re: SG unable to lock VG both nodes TOC

Ernico, The SECOND_CLUSTER_LOCK_VG will be the same name as the FIRST_CLUSTER_LOCK_VG?
only the LOCK_PV will change?
Do you have any idea why i am unable to "vgdisplay or vgchange" vglock and why the syslog is reporting unable to stat c#t#d#.

Thank you
RAC_1
Honored Contributor

Re: SG unable to lock VG both nodes TOC

Can you see the lock disk from both nodes??
You can activate the vg under mcsg as exclusice access.

vgchange -a e vgxx

Anil
There is no substitute to HARDWORK
Mel Burslan
Honored Contributor
Solution

Re: SG unable to lock VG both nodes TOC

I am not sure but one thing I noticed when you mentioned vgchange -a y. Are you sure if your lock disk is still cluster aware ? normally cluster disks get activated by vgchange -a e not -a y.

On the other hand "can not stat disk_device" message in the syslog is not good. Usually can be attributed to a bad disk (not in your case obviously) or a bad disk controller. After the reboot, can you see all three paths to your disk using ioscan ?
________________________________
UNIX because I majored in cryptology...
Juan M Leon
Trusted Contributor

Re: SG unable to lock VG both nodes TOC

RAC/Mel, Before the system crashes I can see and activate the VG on both nodes. after server TOC I am unable to view the VG. I didn't try vgchange -a e vgXX, I forgot that enable has to be in exclusive mode.

-To be able to see the drive on both nodes should activate the drive on share mode
eg. vgchange -a s /dev/vgXX
-Do you guys have any suggestions to force the cluster to look to the second path if the primary becomes unavailable.

Thank you
A. Clay Stephenson
Acclaimed Contributor

Re: SG unable to lock VG both nodes TOC

The one thing that really jumps out at me is "unplugging the net cables on the running node". This is a multiple point of failure and MC/SG is not designed to handle this scenario. Your network should be so robust and redundant that the failure of any one NIC, switch, cable, etc. should not cause a loss of heartbeat. It shouldn't even trigger a package failover but merely a network switchover. What could be expected is the total loss of any one node and that is simulated not by yanking multiple network cables but by yanking the power cords. By having two nodes powered up and yet no heartbeat, you are really in a situation that MC/SG is not designed to handle because you have MPOF's.
If it ain't broke, I can fix that.
RAC_1
Honored Contributor

Re: SG unable to lock VG both nodes TOC

Putting first_cluster_lock_pv (primary path) and second_cluster_lock_pv (alternate path should help)

About cluster aware vgs, you first need to do vgchange -c y vgxx and then start cluster.

Now check what happens.

Anil
There is no substitute to HARDWORK
Enrico P.
Honored Contributor

Re: SG unable to lock VG both nodes TOC

Hi,
if you have other shared vg you can use one of these for the second vg lock e second pv lock.

Have you still the problem of the activation of the vg?

Enrico
melvyn burnard
Honored Contributor

Re: SG unable to lock VG both nodes TOC

First question to ask is what disk are you using for the cluster lock?
Second question is was the cluster lock vg activated on the node you ran the cmapplyconf at the time of activation
third, are you seeing any mesages in syslog that might indicate the cluster lock disc is unavailable?
My house is the bank's, my money the wife's, But my opinions belong to me, not HP!
Devender Khatana
Honored Contributor

Re: SG unable to lock VG both nodes TOC

Hi,

Alternate links are not of any use in case of cluster lock. It only looks for one device file and when it tries to obtain lock it will see only that device file which comes only through one link ( You can configure any). That is why the message in syslog in there can not see this c#t#d#.

This can be achived by second PV lock defined for the device file through different path for the same device.

But worry here should be why after removing cable from one host even second can not see that disk. What type of link is this Fibre or SCSI ? Try to eleminate the problem of lock disk visibility from indivisual hosts without running cluster, once done there should not be any problems for the package to switch to second node when you remove cable from first.

Which cable you are removing exactly ?
How are standby lan configured ? Normally if you remove lan cable, it should switch to standby lan.

HTH,
Devender
Impossible itself mentions "I m possible"
Juan M Leon
Trusted Contributor

Re: SG unable to lock VG both nodes TOC

Regarding this issues using a second lock PV didnt work.
IT was suggested to remove the FIRST_CLUSTER_LOCK_PV and let the configuration use the VG only. Does anyone have completed this task. can you provide of example of MCSG conf file.

Thanks
Enrico P.
Honored Contributor

Re: SG unable to lock VG both nodes TOC

Hi,
it work in my cluster.

MC/SG Version 11.15

Enrico
Juan M Leon
Trusted Contributor

Re: SG unable to lock VG both nodes TOC

Khatana. To answer your question. We are removing all the cables to simulate a server crash. We are assuming that the other node will not be able to talk to the simulated crashing server and will take over the vglock disk.
Enrico, I saw your config file. We will try your suggestion. Are you defining 2 paths on one node and 2 paths on another node. Does each node paths points to a different controllers?. For example:
Node-a
Path1 -> CardA -> controller A
Path2 -> CardA -> controller B
Path3 -> CardB -> controller A
Path4 -> CradB -> controller B

Node-b
Path1 -> CardA -> controller A
Path2 -> CardA -> controller B
Path3 -> CardB -> controller A
Path4 -> CradB -> controller B

Your example.
VOLUME_GROUP /dev/vgoracle
VOLUME_GROUP /dev/vgovo

Node-a
FIRST_CLUSTER_LOCK_PV /dev/dsk/c10t0d4
SECOND_CLUSTER_LOCK_PV /dev/dsk/c16t0d5
Node-b
FIRST_CLUSTER_LOCK_PV /dev/dsk/c14t0d4
SECOND_CLUSTER_LOCK_PV /dev/dsk/c10t0d5

Thanks all for your input and help.
A. Clay Stephenson
Acclaimed Contributor

Re: SG unable to lock VG both nodes TOC

As I tried to tell you earlier, your testing procedure is fundamentally flawed. You can't possibly disconnect all the cables simultaneously to simulate a node failure and depending upon the SCSI architecture you could have multiple open SCSI buses. MS/SG is designed to handle single points of failure (SPOF's) and you are throwing multiple points at failure (MPOF's) at it. You should remove ONE (network or SCSI) cable and test the results. Then replace the cable and remove another cable and test until all of those cables have been tested. You now simulate the failure of an entire node by yanking the power cable(s) from a node and then repeat by similarly downing the other node. If you are worried about downing a node in so crude a fashion then you haven't made your systems robust enough.
If it ain't broke, I can fix that.
Juan M Leon
Trusted Contributor

Re: SG unable to lock VG both nodes TOC

Clay: I understand your point, but even in robust systems there is a chance for the server to crash. Could be a cpu failure or memory problems. and the server will crash. Is that correct? I am thinking that the other MCSG server should be able to acquire the VGLOCK after find that the primary server is unreachable. Is that correct or my theory is incorrect.

Thank you for your help.

Juan M Leon
Trusted Contributor

Re: SG unable to lock VG both nodes TOC

Clay: we are disconnecting only the network cables. Fiber and scsi cables remain connected. Assuming that our testing crashes one server I am thinking that the second server should remain up however it crashes as well.

Any ideas?

Thank you
A. Clay Stephenson
Acclaimed Contributor

Re: SG unable to lock VG both nodes TOC

No, you are not playing by the rules. You have a cluster with MPOF's. By pulling the network cables, there are now no heartbeats BUT both nodes can still access the lock disk. This is not the proper way to simulate a node failure. Your scenario should never be possible in a well-configured cluster because all the heartbeat networks should never fail simultaneously. If that can happen, you haven't configured a robust enough cluster. At most, one heartbeat network should fail --- and this is all MC/SG is designed to handle. Things like CPU failure (in a multi-cpu machine) are considered routine and may or may not trigger a package switch. To properly simulate the node failure, you need to yank the power cord. Now, the heartbeats are lost BUT only one system can access the lock disk and the cluster should be able to cope with this scenario.

One of the most interesting aspects of MC/SG is that by the time you get the systems robust enough for MC/SG to function properly, package switches very seldom occur. Things like disk failure, NIC failure, network switch failures, etc. are are taken care of before MC/SG itself ever comes into action. In fact, I have never had a non user-initiated package failure in nearly 6 years of continuous MC/SG use.
If it ain't broke, I can fix that.