ProLiant Servers (ML,DL,SL)
1753726 Members
4868 Online
108799 Solutions
New Discussion

hp-health and Centos 6.2 Cluster

 
stuv
New Member

hp-health and Centos 6.2 Cluster

Hi

 

 

We have setup a Centos Cluster based on 6.4, based on luci, and two ricci nodes.

We have setup ricci on the nodes, joined them to the cluster via luci, and rebooted.

 

Service Groups have been setup, and failover works perfectly.

 

When we install hp-health rpm on both nodes, it appears that communication between the nodes is affected, and the 2nd node in the cluster, can not get the updated configuration.  This breaks the cluster.

 

 


Apr 15 17:56:35 ted corosync[2556]:   [CMAN  ] Unable to load new config in corosync: New configuration version has to be newer than current running configuration
Apr 15 17:56:35 ted corosync[2556]:   [CMAN  ] Can't get updated config version 42: New configuration version has to be newer than current running configuration#012.
Apr 15 17:56:35 ted corosync[2556]:   [CMAN  ] Activity suspended on this node
Apr 15 17:56:35 ted corosync[2556]:   [CMAN  ] Error reloading the configuration, will retry every second
Apr 15 17:56:35 ted corosync[2556]:   [CMAN  ] Node 1 conflict, remote config version id=42, local=41

 

 

Are there any ideas on this matter?

 

Kind of ironic that hp-health is making my cluster sick....

1 REPLY 1
Matti_Kurkela
Honored Contributor

Re: hp-health and Centos 6.2 Cluster

Given that none of the utilities within the hp-health RPM is even capable of network communication, your theory seems very strange indeed. I would think it more likely that there was some completely separate problem in your cluster which became apparent after you rebooted.

 

Please run 

cman_tool status |grep "Config Version:"

on both cluster nodes, to see what each node assumes the current cluster configuration version number to be. Do the versions match the version listed in the beginning of the /etc/cluster/cluster.conf file? Are the version numbers the same on all the nodes?

 

I've once seen a case where ricci initially failed to perform some task, so it left the task file in its queue directory, /var/lib/ricci/queue. When the system was later rebooted, ricci tried to run the task again, causing the cluster configuration to go out of sync.

 

If the cluster configuration version number is out of sync, you should first stop the ricci agents on both nodes, remove any existing ricci job files from /var/lib/ricci/queue, then restart the ricci agents.

 

After that, if the nodes have different versions of /etc/cluster/cluster.conf, compare them to find the difference in them. Pick the one that seems to be the most correct, and increase its version number to a value higher than

cman_tool status | grep "Config Version:"

reports on any node.

 

E.g. if one node reports version 41 and the other 42, you should edit the correct cluster.conf file to have version 43.

 

Then run 

cman_tool version -r

 on the node that has the updated cluster.conf with the highest version number (though I think it will work on any cluster node). This command will propagate the updated cluster.conf to all nodes through ricci automatically. This should clear the configuration version conflict in your cluster.

 

Anyway, after doing any major operations through luci, and especially if some luci operation has failed, you should check the /var/lib/ricci/queue directories on your cluster nodes. If they contain any files, it means ricci thinks some operation has not been performed to completion yet.

 

If you reboot at this point, ricci will retry the operation after the reboot (possibly after a small delay), which may cause nasty surprises and/or confusion. If ricci seems to be unable to complete some task on some node, you should stop ricci on that node, clear the queue directory, and restart ricci before doing anything else.

MK