System Administration
cancel
Showing results for 
Search instead for 
Did you mean: 

quorum lost in cman blocking activivty

AK6
Occasional Visitor

quorum lost in cman blocking activivty

Hi,

I have a cluster with two nodes if one of the node goes down the service is getting started in the 2nd node but suddenly it gets stopped and getting the below error

rgmanager - Quorum Dissolved.

I am attaching the log messages.Please help.

 

Nov 16 11:20:24 db02 rgmanager[13899]: [mysql] Starting Service mysql:MySQL_IVRAF
Nov 16 11:20:26 db02 rgmanager[3195]: Service service:IVRAFDB started
Nov 16 11:20:46 db02 corosync[2514]: [CMAN ] quorum lost, blocking activity
Nov 16 11:20:46 db02 corosync[2514]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
Nov 16 11:20:46 db02 corosync[2514]: [QUORUM] Members[1]: 2
Nov 16 11:20:46 db02 corosync[2514]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Nov 16 11:20:46 db02 rgmanager[3195]: #1: Quorum Dissolved
Nov 16 11:20:46 db02 kernel: dlm: closing connection to node 1
Nov 16 11:20:46 db02 corosync[2514]: [CPG ] chosen downlist: sender r(0) ip(10.89.7.217) ; members(old:2 left:1)
Nov 16 11:20:46 db02 corosync[2514]: [MAIN ] Completed service synchronization, ready to provide service.
Nov 16 11:20:46 db02 rgmanager[14147]: [mysql] Stopping Service mysql:MySQL_IVRAF
Nov 16 11:20:49 db02 rgmanager[14182]: [mysql] Stopping Service mysql:MySQL_IVRAF > Succeed
Nov 16 11:20:49 db02 rgmanager[14233]: [fs] unmounting /opt/MariaDB

1 REPLY
Matti_Kurkela
Honored Contributor

Re: quorum lost in cman blocking activivty

Looks like some version of RedHat Cluster, probably RHEL 6 or older?

Your cluster is losing its multicast connection between the cluster nodes. This is most commonly because of an incomplete multicast configuration in the network.

Have you verified that the multicast communication between the cluster nodes works, for more than 200 seconds at a time?

If not, verify that IGMP queries are being received at each cluster node (typically one query per minute) and that the nodes are successfully responding to them with IGMP reports, like this:

# tcpdump -i eth0 -s0 -Kpnvvv igmp
12:47:28.868454 IP (tos 0xc0, ttl 1, id 13997, offset 0, flags [none], proto IGMP (2), length 32, options (RA))
    192.168.24.1 > 224.0.0.1: igmp query v2
12:47:35.654118 IP (tos 0xc0, ttl 1, id 0, offset 0, flags [DF], proto IGMP (2), length 32, options (RA))
    192.168.24.76 > 239.255.255.250: igmp v2 report 239.255.255.250
12:47:36.800857 IP (tos 0xc0, ttl 1, id 0, offset 0, flags [DF], proto IGMP (2), length 32, options (RA))
192.168.24.76 > 224.0.0.251: igmp v2 report 224.0.0.251

Multicasting is a sometimes poorly-understood optional part of IPv4 networking. (In IPv6 it will be mandatory.)

When a cluster starts up, or when the node receives an IGMP query packet, each cluster node sends an initial IGMP report to tell any IGMP-snooping switches and any multicast routers that the nodes want to receive the multicast traffic of the cluster infrastructure. If there is no source of IGMP Query packets in the network segment, this will be the only IGMP report that will be sent. An IGMP report will typically expire in 180 seconds: typically a multicast router will send an IGMP query every 60 seconds or so.

If a switch with an IGMP-snooping feature will not receive IGMP reports from a particular port, the switch will assume that the host connected to that port is not interested in any multicast traffic, and stops passing multicasts to it. If your cluster is connected to two different switches, this means the delivery of multicasts between the switches will stop once the initial IGMP reports expire (in practice, about 188 seconds after the start of multicast traffic). This could cause exactly the behavior you're seeing.

If IGMP-snooping is enabled on the switches, you will also need a source of IGMP queries in that network segment to keep the switches aware of multicast traffic destinations. It does not have to be a real multicast router; some switches have an "IGMP querier" feature that can be used when multicasts are needed in a particular subnet, but real multicast routing between subnets is not required.

There is a command named "omping" which can be used to test multicast connectivity between hosts. It is available in standard RHEL repository in RHEL 6.1 and above, or available in the EPEL repository for RHEL 5.x.

If you find your network stops passing multicast traffic at about 180-200 seconds after the traffic has successfully begun, here is a Cisco document that you can show your network admins:

https://www.cisco.com/c/en/us/support/docs/switches/catalyst-6500-series-switches/68131-cat-multicast-prob.html

The document is originally written for Cisco Catalyst 6500, but the same problem can occur with other hardware too.

Alternatively, if your cluster has 4 nodes or less and is not using GFS, you can configure your cluster to use UDP unicast traffic instead of multicasts. In 2-node cluster, the amount of traffic required is about the same, but if there are more than 2 nodes, switching to UDP unicast will increase the amount of cluster coordination/heartbeat traffic required (as things will need to be sent separately to each cluster node, instead of just once by multicast). If you are using "luci" web GUI, select your cluster, go to Configure -> Network, change "Network Transport Type" to "UDP Unicast (UDPU)" and restart your cluster.

MK