One of the cluster node down

admin1979 · ‎01-11-2010

Hello,

We have a 2 node cluster with TRU64.
Today we found one of the cluster nodes is at the boot prompt. We started the node by giving "b" and the node is back online.
When checked the binary logs, it showed that the system had a CPU panic on Jan 9 09:43:38 2010 itself. See the logs below.

We would like to know if there is any serious problem occurred? How can we analyse more?

----- EVENT INFORMATION -----

EVENT CLASS ERROR EVENT
OS EVENT TYPE 302. PANIC
SEQUENCE NUMBER 39869.
OPERATING SYSTEM DEC OSF/1
OCCURRED/LOGGED ON Sat Jan 9 09:43:38 2010
OCCURRED ON SYSTEM bwgc559
SYSTEM ID x000B0022
SYSTYPE x00000000
PROCESSOR COUNT 2.
PROCESSOR WHO LOGGED x00000001
MESSAGE panic (cpu 1): _ics_unable_to_make_progress:
_heartbeat checking blocked

Additionally we are finding the system to be very slow now,

TOP o/p shows this,

load averages: 7.85, 7.44, 7.42 11:06:58
88 processes: 3 running, 33 waiting, 28 sleeping, 24 idle

CPU states: 0.0% user, 0.0% nice, 99.3% system, 0.5% idle

Memory: Real: 2681M/4007M act/tot Virtual: 1479M use/tot Free: 1208M

PID USERNAME PRI NICE SIZE RES STATE TIME CPU COMMAND
524288 root 0 0 4559M 76M run 45:30 192.60% kernel idle
528506 root 42 0 0K 0K run 0:33 1.20% icssvr_daemon_

We can see that the system is very much occupied.

Can someone please help?

Martin Moore · ‎01-11-2010

This is a somewhat generic panic message. It means that the system couldn't communicate across the cluster interconnect for a specified period (longer than cluster_rebuild_delay, which is 240 seconds by default), so it panicked to take itself out of the cluster. There are a few problems that are known to cause this, with fixes in the latest patch kit for V5.1B.

To determine the specific cause for a particular incident of the crash requires analyzing the crash dump. If you have a support contract with HP, you could log a case to have this done. Or if you can do crash analysis yourself, you could at least determine if it's something already fixed in a newer patch kit than you are running. If neither of those is true, all I can suggest is to put on the latest kit and hope for the best.

Martin

I work for HPE
A quick resolution to technical issues for your HPE products is just a click away HPE Support Center
See Self Help Post for more details

admin1979 · ‎09-01-2010

As mentioned above.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

One of the cluster node down

One of the cluster node down

Re: One of the cluster node down

Re: One of the cluster node down