Operating System - Tru64 Unix
1753797 Members
7413 Online
108799 Solutions
New Discussion юеВ

One of the cluster node down

 
admin1979
Super Advisor

One of the cluster node down

Hello,

We have a 2 node cluster with TRU64.
Today we found one of the cluster nodes is at the boot prompt. We started the node by giving "b" and the node is back online.
When checked the binary logs, it showed that the system had a CPU panic on Jan 9 09:43:38 2010 itself. See the logs below.

We would like to know if there is any serious problem occurred? How can we analyse more?

----- EVENT INFORMATION -----

EVENT CLASS ERROR EVENT
OS EVENT TYPE 302. PANIC
SEQUENCE NUMBER 39869.
OPERATING SYSTEM DEC OSF/1
OCCURRED/LOGGED ON Sat Jan 9 09:43:38 2010
OCCURRED ON SYSTEM bwgc559
SYSTEM ID x000B0022
SYSTYPE x00000000
PROCESSOR COUNT 2.
PROCESSOR WHO LOGGED x00000001
MESSAGE panic (cpu 1): _ics_unable_to_make_progress:
_heartbeat checking blocked



Additionally we are finding the system to be very slow now,

TOP o/p shows this,

load averages: 7.85, 7.44, 7.42 11:06:58
88 processes: 3 running, 33 waiting, 28 sleeping, 24 idle

CPU states: 0.0% user, 0.0% nice, 99.3% system, 0.5% idle

Memory: Real: 2681M/4007M act/tot Virtual: 1479M use/tot Free: 1208M

PID USERNAME PRI NICE SIZE RES STATE TIME CPU COMMAND
524288 root 0 0 4559M 76M run 45:30 192.60% kernel idle
528506 root 42 0 0K 0K run 0:33 1.20% icssvr_daemon_


We can see that the system is very much occupied.

Can someone please help?
2 REPLIES 2
Martin Moore
HPE Pro

Re: One of the cluster node down

This is a somewhat generic panic message. It means that the system couldn't communicate across the cluster interconnect for a specified period (longer than cluster_rebuild_delay, which is 240 seconds by default), so it panicked to take itself out of the cluster. There are a few problems that are known to cause this, with fixes in the latest patch kit for V5.1B.

To determine the specific cause for a particular incident of the crash requires analyzing the crash dump. If you have a support contract with HP, you could log a case to have this done. Or if you can do crash analysis yourself, you could at least determine if it's something already fixed in a newer patch kit than you are running. If neither of those is true, all I can suggest is to put on the latest kit and hope for the best.

Martin
I work for HPE
A quick resolution to technical issues for your HPE products is just a click away HPE Support Center
See Self Help Post for more details

Accept or Kudo

admin1979
Super Advisor

Re: One of the cluster node down

As mentioned above.