Operating System - HP-UX
1833596 Members
3428 Online
110061 Solutions
New Discussion

System hang detected via timer popping on cluster nodes

 
SOLVED
Go to solution
Michele (Mike) Alberton
Regular Advisor

System hang detected via timer popping on cluster nodes

Hi Folks,

this is so strange.

We have two rp7400 Service Guarded (HP-UX 11.00) sharing a dual controller va7400.

Unfortunately one of the controllers in the va is DOWN so that we reconfigured it to use one single controller.

The cmclconfig.ascii has the FIRST_CLUSTER_LOCK_PV configured on the device file targeting the controller which is down. Cluster goes fine, but each time one of the tw nodes (let's say the standby) gets rebooted the main follows pretty soon with the "System hang detected via timer popping", after dumping memory.

Is there any explanation for that ? I've cmhalted the standby node before rebooting it and this time the phenomenon disappears. I'll change the lock to point to the controller which is up, but it looks to me a SPOF in such a redundant environment, could you feedback ?

Thanks !

Mike
8 REPLIES 8
Francis_12
Trusted Contributor

Re: System hang detected via timer popping on cluster nodes

Hello,

You should definitively log a call at your HP local support service.

The 'timer popping' issue can be issued by several root causes.

You will need to provide :

- the GSP logs
- the eventual kernel dump
- the syslog.log, OLDsyslog.log and dmesg outputs

Hope this helps, Bye.

Francis DERDEYN - HP-UX ASCE.
Michele (Mike) Alberton
Regular Advisor

Re: System hang detected via timer popping on cluster nodes

Hi Francis,

I'd be fine with that if only one of the two nodes was involved, but actually if both nodes are part of the cluster, whichever reboots the other will crash.

If I cmhaltnode the standby, situation is ok. It looks to me an issue with the way the cluster lock (which is unobtainable because the disk controller is down) is handled in such a circumstance.

Thanks,

Mike
Rajeev  Shukla
Honored Contributor

Re: System hang detected via timer popping on cluster nodes

Hi Mike,
Your cluster will run fine without FIRST_CLUSTER_LOCK_PV being accessible as long as there is no cluster reformation. When the cluster reformation happens i.e a node leaves or joins the cluster all the online nodes will try to access the FIRST_CLUSTER_LOCK_PV, and failing to do so results in unexpected behaviour like hanging etc.
Same in your case your cluster is fine as long as other node is not rebooted, and as soon as you do so, the other online server tries to access the FIRST_CLUSTER_LOCK_PV and since it cant the node hangs. The cmcld daemon is in a hung state.

The first thing you might do is make sure the FIRST_CLUSTER_LOCK_PV disk seen by both the nodes.

Francis_12
Trusted Contributor

Re: System hang detected via timer popping on cluster nodes

Hello back again,

If the 2 nodes are facing the same msg, i would strongly recommend you to log a call.

For me, this can absolutely be related to firmware issues : PDC, GSP.

Hope this helps, Bye.

Francis DERDEYN - HP-UX ASCE.
Rajeev  Shukla
Honored Contributor
Solution

Re: System hang detected via timer popping on cluster nodes

Hi Mike,
Just realised i meant "crash" by hanging.
And this is beacuse when one of youyr node is leaving the cluster, cluster reformation happens, but since the node which is up can not access the cluster lock disk to become the cluster manager a TOC happens on that node to halt that node and thats the normal bahaviour of MC/SG.

Other thing you might worth consider.
If you just have 2 node cluster, why do you want a cluster lock disk?
Michele (Mike) Alberton
Regular Advisor

Re: System hang detected via timer popping on cluster nodes

Thanks Folks !

Regarding the need of a lock actually this is an old configuration that's been there for a long time, we could take this chance to think about changing it :-)

Thanks agai for your quick participation !

Mike

Re: System hang detected via timer popping on cluster nodes

I'm not sure what Rajeev is suggesting, but you MUST have some form of cluster lock in a two node cluster. This can be either a cluster lock disk like you currently have, or (if you are on the latest version of ServiceGuard) a seperate quorum server.

No cluster lock is not an option in a 2 node cluster.

HTH

Duncan

I am an HPE Employee
Accept or Kudo
Steven E. Protter
Exalted Contributor

Re: System hang detected via timer popping on cluster nodes

I have had that error. I had the Core I/O card on my rp5450 replaced and it stopped happening.

Took some time to convince hardware support they had to do anything though.

SEP
Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com