Cluster lock was denied. Lock was obtained by another node

Farid Hasanov · ‎01-19-2008

Hello,
I have two node cluster with EVA shared storage. Cluster and package are configured for failover.
When I’m halting node1 package is moving and running on node 2 without any problems.
But : If I physically disconnect all networks cables ( I have 4 LAN cards on each node) from node1 –package is trying to be run on node2 and after couple of minutes node2 restarts. Syslog on node2 give this error:

“
Jan 19 11:32:42 db2 cmcld[13896]: Timed out node db1. It may have failed.
Jan 19 11:32:42 db2 cmcld[13896]: Attempting to form a new cluster
Jan 19 11:32:42 db2 cmcld[13896]: Beginning standard election
Jan 19 11:32:43 db2 cmfileassistd[14084]: Updated file /var/adm/cmcluster/frdump.cmcld.7 (length = 251602).
Jan 19 11:32:49 db2 cmcld[13896]: Obtaining Cluster Lock
Jan 19 11:32:50 db2 cmcld[13896]: Cluster lock was denied. Lock was obtained by another node.
Jan 19 11:32:50 db2 cmcld[13896]: Attempting to form a new cluster
Jan 19 11:32:50 db2 cmcld[13896]: Beginning standard election
Jan 19 11:32:58 db2 cmcld[13896]: Cluster lock has been denied
Jan 19 11:33:43 db2 cmcld[13896]: Service cmfileassistd terminated due to an exit(0)”

After I disconnect network cables on node1 I still can access shared volumes. This could be a reason why package cannot start on node2.
Is there any way to configure cluster to run HALT_Script when all network interfaces are down???

Has anyone met this problem???

Thanks in advance,
Farid.

Duncan Edmonstone · ‎01-19-2008

Farid,

That's not a problem - that's expected behaviour in a 2-node cluster when one node can't communicate with the other. You might 'know' that node 2 still has LAN connections and could run the package and that node 1 has no LAN connections and can't run the package, but as there is no LAN interconnect between the 2 nodes, neither can know the state of the other and therefore have to use some tie-break mechanism - in this case a race for the cluster lock, which unfortunately node 2 lost. To do anything else would be to risk the integrity of your data by causing a split-brain scenario (both nodes think they are the remaining cluster and both access the data on disk concurrently causing corruption).

Serviceguard protects against single points of failure - removing all the LAN connections is multiple points of failure, so you can't assume the cluster will recover from this automatically. In these situations Serviceguard considers protecting the integrity of your data to be more important than keeping the package up.

One alternative to get different behaviour in this scenario is to use a Serviceguard quorum server rather than a cluster lock disk. If you had used a quorum server then in this scenario I think node 2 would have remained running the cluster - of course quorum servers place more emphasis on the resilience of your network, so consider carefully.

Read this paper here to get a full understanding of this cluster behaviour:

http://docs.hp.com/en/B3936-90078/B3936-90078.pdf

HTH

Duncan

I am an HPE Employee

Farid Hasanov · ‎01-21-2008

Thanks for explonations Duncan.

But if node1 does not have network connection is there any way to configure node1 to halt node in this case.????????

let say halt node1 when all interfaces or heardbeat interfaces are down.

Stephen Doud · ‎01-21-2008

Serviceguard cannot determine where the network failure has occured outside of the server itself, so it cannot determine which node 'should' be the first server to get to the cluster lock disk to win the race and reform the cluster... hence, in your case, the server with ALL network cables disconnected was the first server to get to the cluster lock disk.
If a quorum server were used, a complete network failure on one node would also prevent access to the quorum server, guaranteeing that node would be ejected from the cluster.

Duncan Edmonstone · ‎01-21-2008

If you and I sit in seperate soundproofed rooms communicating via an intercom, and then the intercom stops working - how do I determine that its my intercom thats failed and not yours? In the absence of a third part who also talks to both of us (like aquorum server) there's no way of knowing.

HTH

Duncan

I am an HPE Employee

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Cluster lock was denied. Lock was obtained by another node

Cluster lock was denied. Lock was obtained by another node

Re: Cluster lock was denied. Lock was obtained by another node

Re: Cluster lock was denied. Lock was obtained by another node

Re: Cluster lock was denied. Lock was obtained by another node

Re: Cluster lock was denied. Lock was obtained by another node