Operating System - HP-UX
1833018 Members
2197 Online
110048 Solutions
New Discussion

Cluster Formation Failure

 
SOLVED
Go to solution
Lynne Dawson
Occasional Contributor

Cluster Formation Failure

I have recently seen a scenario where after re-applying the cluster configuration (using cmcheckconf and cmapplyconf with no errors) the cluster failed to form when running cmruncl. Messages in syslog were as follows:

It appears that package applications or resources may be active on this node. Re-starting the cluster could cause data corruption. To recover from this situation reboot this system.
After ensuring that no package applications or resources are active, you can override this data integrity protection by issuing the following command (which allow the daemon to start without rebooting):
rm /var/adm/cmcluster/.cm_start_time
touch /var/adm/cmcluster/.cm_start-time

None of the shared volume groups were activated at this time, and none of the file systems mounted. Nor was the package IP assigned to any network card, and the only daemon running was the SNMP agent, cmsnmpd.
After removing and touching the .cm_start_time file (without rebooting) the cluster came up without any problems.

Can anyone explain:
a) why the cluster wouldn't form in the first place
b) what is the purpose of .cm_start_time?

Many thanks
Lynne
8 REPLIES 8
John Palmer
Honored Contributor

Re: Cluster Formation Failure

Hi Lynne,

It sounds to me as though the cluster had not been shutdown properly.

Had you shut the cluster down before reapplying the configuration with cmhaltcl or was cmapplyconf run on the running cluster?

Regards,
John

.
Lynne Dawson
Occasional Contributor

Re: Cluster Formation Failure

John

The cluster was shutdown with a cmhaltcl, and then the cmapplyconf was run to update package configuration only. No errors in syslog for either the cmhaltcl or the cmapplyconf.

Lynne
John Palmer
Honored Contributor

Re: Cluster Formation Failure

Hi again,

I must confess that I've never seen this particular problem and can find no information about it.

Are you running the latest version of Serviceguard and do you have all the relevant patches installed?

Recent versions of Serviceguard allow you to reconfigure the cluster (there are a few changes not allowed) while it is still running. As you chose to close the cluster first, does that mean that you are running an old version?

Regards,
John
James R. Ferguson
Acclaimed Contributor

Re: Cluster Formation Failure

Lynne:

Along the same lines as John...

Are the versions of MCSVGD the same on both nodes of the cluster?

Was the binary configuration correctly ported to all nodes in the cluster?

Can you post the the syslog file from both (all) nodes?

...JRF...
Lynne Dawson
Occasional Contributor

Re: Cluster Formation Failure

John / James

I can't post the syslog files, since the situation occurred during a training course, and the nodes have been rebooted numerous times since! But both nodes are running MC/SG 11.03, and (as far I know) they are patched correctly.

MC/SG 11.03 does allow some online reconfiguration, but since the network IPs were being altered, the cluster had to be shutdown before applying the new configuration.

James, your point about correctly porting out the binary could be valid (this is a training scenario and the students could have made a mistake somewhere), but I think it's unlikely since cmapplyconf does the actual distribution.

As its a training environment, and syslog told us how to resolve the issues (remove and touch the .cm_start_time file) I'm not worried if we don't find an answer. Just curious to know what the purpose of .cm_start_time is, especially since one of the students asked and I didn't know the answer! But thanks for your imput.

Lynne
John Palmer
Honored Contributor

Re: Cluster Formation Failure

Hi Lynne,

The first four bytes are a binary representation of the time that the cluster was started (in seconds since the epoch).

As to the last four bytes - I'm not sure but it could be a representation of microseconds.

After all the filename does give us a bit of a clue ;-)

Regards,
John

melvyn burnard
Honored Contributor
Solution

Re: Cluster Formation Failure

.cm_start_time is used by cmcld when it goes to try and start cluster activities on a node.
It checks for the file, if it finds it, is it 0 in size? If yes, it means the cluster was halted gracefully.
if no, is the bootime LATER than cluster starttime, using .cm_start_time?
if yes, update .cm_start_time and then start cluster activity.
If no, cmcld does not start, and logs the reason why. hence the instruction to remove the file and then touch it, creating a 0 size file.

11.03 has been superseded by 11.04, then 11.05, then 11.07, then 11.08 and is CURRENTLY 11.09
I would advise running courses on the newest possible release, with the appropriate patches.
My house is the bank's, my money the wife's, But my opinions belong to me, not HP!
Stephen Doud
Honored Contributor

Re: Cluster Formation Failure

/var/adm/cmcluster/.cm_start-time is zeroed when the cluster is halted (normally).
When the cluster is started, and that file is non-zero, the "uptime" is checked to see if the server was rebooted more recently. If not, those messages you saw in syslog are reported. So, the cmhaltcl was unable to zero the .cm_start_time file for some reason.

If the file was missing, you would have seen these sort of messages

cmcld: failed to open /var/adm/cmcluster/.cm_start_time
cmclconfd: the ServiceGuard daemon, /usr/lbin/cmcld[847], exited with a status of 1
cmclconfd: lost connection to cluster daemon
cmclconfd: Unable to lookup any node information in CDB: No such file or directory