topic Re: Cluster problem in Operating System - HP-UX

Cluster problem

M. Tariq Ayub — Wed, 21 Jan 2004 21:50:19 GMT

Hi,

I have two node cluster. there are 3 package (billing_pkg, ratig_pkg and ob2cm_pkg) running on this two nodes. rating package switching was disable. I got the following eroor message in syslog. but there was no package interruption (halt or start)during that time.

Billing syslog

Jan 21 17:22:46 billing cmcld: Timed out node rating. It may have failed.
Jan 21 17:22:46 billing cmcld: Attempting to adjust cluster membership
Jan 21 17:22:55 billing cmcld: Obtaining Cluster Lock
Jan 21 17:22:56 billing cmcld: Turning off safety time protection since the cluster
Jan 21 17:22:56 billing cmcld: may now consist of a single node. If ServiceGuard
Jan 21 17:22:56 billing cmcld: fails, this node will not automatically halt
Jan 21 17:22:56 billing cmcld: This will not affect the behavior of Package Failfast
Jan 21 17:22:56 billing cmcld: or Service Failfast. If such a package or service fail,
Jan 21 17:22:56 billing cmcld: this node will automatically halt.
Jan 21 17:23:04 billing cmcld: Enabling safety time protection
Jan 21 17:23:04 billing cmcld: Attempting to adjust cluster membership
Jan 21 17:23:04 billing cmcld: Clearing Cluster Lock
Jan 21 17:23:04 billing cmcld: Resumed updating safety time
Jan 21 17:23:05 billing cmcld: 2 nodes have formed a new cluster, sequence #3
Jan 21 17:23:05 billing cmcld: The new active cluster membership is: billing(id=1), rating(id=2)

Rating syslog

Jan 21 17:23:02 rating cmcld: Warning: cmcld process was unable to run for the last 23 seconds,
Jan 21 17:23:02 rating cmcld: which is longer than the node timeout (8 seconds)
Jan 21 17:23:02 rating cmcld: Communication to node billing has been interrupted
Jan 21 17:23:02 rating cmcld: Node billing may have died
Jan 21 17:23:02 rating cmcld: Attempting to form a new cluster
Jan 21 17:23:04 rating cmcld: Attempting to adjust cluster membership
Jan 21 17:23:05 rating cmcld: Resumed updating safety time
Jan 21 17:23:02 rating cmcld: Communication to node billing has been interrupted
Jan 21 17:23:05 rating cmcld: 2 nodes have formed a new cluster, sequence #3
Jan 21 17:23:02 rating cmcld: Attempting to form a new cluster
Jan 21 17:23:05 rating cmcld: The new active cluster membership is: billing(id=1), rating(id=2)

What may be the reason.

Re: Cluster problem

Geoff Wild — Wed, 21 Jan 2004 22:02:49 GMT

Looks like maybe your heartbeat lan is timing out.... do you have a dedicated heartbeat lan?

What is your HEARTBEAT_INTERVAL?

Do you have the HEARTBEAT set across all
available networks?

Rgds...Geoff

Re: Cluster problem

M. Tariq Ayub — Wed, 21 Jan 2004 22:06:41 GMT

We have dedicated heart beat. but my question is if thre was a poroblem in HB then new cluster will form on billing node as it bear cluster lock disk. There was no problem with the package.

Re: Cluster problem

Sridhar Bhaskarla — Thu, 22 Jan 2004 00:41:00 GMT

Hi,

I would first look at the rating server. It said cmcld process was unable to run for 23 seconds means the communication to billign server from rating server got interupted for more than the node_timeout value.

When this happens, the cluster will try to reform and a notice will be sent to all the nodes. If any node fails to respond to that notice will TOC itself if it doesn't have the cluster lock.

The time stamps of cmcld logs in your syslog.log indicates the above.

I would pull out some stats from rating server during 17:21 - 17:24 and see if there was any abnormal activity like high system load etc., Even buffer flushes may cause the system to temporarily hang if your buffer cache is too large.

-Sri

Re: Cluster problem

Sridhar Bhaskarla — Thu, 22 Jan 2004 00:51:52 GMT

Hi (Again),

To answer your second question, during the reformation, both the nodes responded back hence the cluster got reformed without package interruptions just in time. This is common during temporary hangs. However, if this symptom is not treated, then it may cause extended timeouts later and may cause the nodes to fail (depending on your configuration).

-Sri

Re: Cluster problem

melvyn burnard — Thu, 22 Jan 2004 03:03:49 GMT

you have had what is often referred to as a mini-hang on the second node, resulting in a loss of heartbeat communications between the nodes.The first node has then attempted to reform as a single node cluster, and obtaining the cluster lock disc in order to do this.
Luckily for you, the heartbeat communications were restored just before the second node would have TOC'ed and the cluster then reformed as a 2 node cluster.
I would suggest you look at the cluster settings on the cluster, but more importantly investigate why the node was unable to run cmcld, maybe patches need to be updated.