Re: Serviceguard restart or reset ?

Lee seung-jun · ‎04-27-2008

i have two rp7420 server.(OS version 11.11)
one hostname is mercury
the other is earth.
two server serviceguard A.11.16.00 installed and services.

two days ago i am install patch.
patch list is next:
HWEnable 0612
GOLDQPK 0712
Serviceguard Patch PHSS_36898
And rp7420 firmware ugrade 4.10 version

yesterday was no problem
today morning, i check syslog.log and see next error message

---------------------------------------------
mercury syslog.log

Apr 28 00:03:19 mercury cmcld: Warning: cmcld process was unable to run for the last 19.81 sec
onds,
Apr 28 00:03:19 mercury cmcld: which is longer than the node timeout (5.00 seconds)
Apr 28 00:03:19 mercury cmcld: Timer_loop delayed: current state=1 pop=(0,12116943), now=(0,12
118924), delta=19s(0,1981)
Apr 28 00:03:19 mercury cmcld: Timer_loop's previous check_timers started at tsb (0,12116942)
and lasted 19s (0,1982) executed 2 callbacks
Apr 28 00:03:19 mercury cmcld: Timer_loop's previous sigwait started at tsb (0,12116923) and l
asted 0s (0,19)
Apr 28 00:03:19 mercury cmcld: Timer_loop's previous cm_lock started at tsb (0,12116942) and l
asted 0s (0,0)
Apr 28 00:03:19 mercury cmcld: Timer_loop's last timer callback (type=8,id=-1) started at tsb
(0,12116942) and lasted 19s (0,1982)
Apr 28 00:03:19 mercury cmcld: Timer_loop's last greater than 1s timer callback (type=8,id=-1)
started at tsb (0,12116942) and lasted 19s (0,1982)
Apr 28 00:03:19 mercury cmcld: Could not send Heartbeat message to earth
Apr 28 00:03:19 mercury cmcld: Node earth may have died
Apr 28 00:03:19 mercury cmcld: Attempting to form a new cluster
Apr 28 00:03:19 mercury cmcld: Beginning standard election
Apr 28 00:03:19 mercury cmcld: timers delayed 16.20 seconds
Apr 28 00:03:19 mercury cmcld: Timer_loop delayed: current state=3 pop=(0,12117304), now=(0,12
118924), delta=16s(0,1620)
Apr 28 00:03:19 mercury cmcld: Timer_loop has been executing timerp(type=41, id=0, poptime=(0,
0))since tsb (0,12118924) for 0s(0,0)
Apr 28 00:03:19 mercury cmcld: Timer_loop has been executing check_timer since tsb (0,12118924
) for 0s (0,0)
Apr 28 00:03:19 mercury cmcld: Timer_loop's previous check_timers started at tsb (0,12116942)
and lasted 19s (0,1982) executed 2 callbacks
Apr 28 00:03:19 mercury cmcld: Timer_loop's previous sigwait started at tsb (0,12116923) and l
asted 0s (0,19)
Apr 28 00:03:19 mercury cmcld: Timer_loop's previous cm_lock started at tsb (0,12116942) and l
asted 0s (0,0)
Apr 28 00:03:19 mercury cmcld: Timer_loop's last timer callback (type=6,id=1) started at tsb (
0,12118924) and lasted 0s (0,0)
Apr 28 00:03:19 mercury cmcld: Timer_loop's last greater than 1s timer callback (type=8,id=-1)
started at tsb (0,12116942) and lasted 19s (0,1982)
Apr 28 00:03:19 mercury cmcld: Communication to node earth has been interrupted
Apr 28 00:03:19 mercury cmcld: Attempting to form a new cluster
Apr 28 00:03:19 mercury cmcld: Beginning standard election
Apr 28 00:03:21 mercury cmclconfd[5420]: Updated file /var/adm/cmcluster/frdump.cmcld.6 for no
de mercury (length = 512096).
Apr 28 00:03:21 mercury cmcld: Attempting to adjust cluster membership
Apr 28 00:03:21 mercury cmcld: Beginning standard partial election
Apr 28 00:03:22 mercury cmcld: Resumed updating safety time
Apr 28 00:03:22 mercury cmcld: 2 nodes have formed a new cluster, sequence #3
Apr 28 00:03:22 mercury cmcld: The new active cluster membership is: earth(id=1), mercury(id=2
)
Apr 28 07:55:46 mercury cmclconfd[11882]: ERROR: The identd authenticated user name () did not
match with the sender user name (root) while querying for node earth. Exiting.

----------------------------------------------
earth syslog.log

Apr 28 00:03:04 earth cmcld: Timed out node mercury. It may have failed.
Apr 28 00:03:04 earth cmcld: Attempting to adjust cluster membership
Apr 28 00:03:04 earth cmcld: Beginning standard partial election
Apr 28 00:02:54 earth vmunix: NFS fsstat failed for server mercury: RPC: Timed out
Apr 28 00:03:10 earth cmcld: Obtaining Cluster Lock
Apr 28 00:03:11 earth cmcld: Successfully obtained the Cluster Lock
Apr 28 00:03:11 earth cmcld: Turning off safety time protection since the cluster
Apr 28 00:03:11 earth cmcld: may now consist of a single node. If Serviceguard
Apr 28 00:03:11 earth cmcld: fails, this node will not automatically halt
Apr 28 00:03:21 earth cmcld: Enabling safety time protection
Apr 28 00:03:21 earth cmcld: Attempting to adjust cluster membership
Apr 28 00:03:21 earth cmcld: Beginning standard partial election
Apr 28 00:03:21 earth cmclconfd[4179]: Updated file /var/adm/cmcluster/frdump.cmcld.0 for node
earth (length = 512096).
Apr 28 00:03:21 earth cmcld: Resumed updating safety time
Apr 28 00:03:21 earth cmclconfd[4179]: Updated file /var/adm/cmcluster/frdump.cmcld.1 for node
earth (length = 12444).
Apr 28 00:03:22 earth cmcld: Clearing Cluster Lock
Apr 28 00:03:22 earth cmcld: 2 nodes have formed a new cluster, sequence #3
Apr 28 00:03:22 earth cmcld: The new active cluster membership is: earth(id=1), mercury(id=2)
Apr 28 00:03:23 earth cmcld: Successfully cleared Cluster Lock

-------------------------------------------------

what can i do this condition?
why did this ?

whiteknight · ‎04-27-2008

hi,

suspected possible hardware failure, i would suggest you have a proper shutdown and bootup and see you have any further errors,

if this sympton still showing, better log case to HP

WK

Problem never ends, you must know how to fix it

Bill Hassell · ‎04-27-2008

Since both systems worked for a day, it is not likely there was a patch problem. First, check /var/adm/syslog/syslog.log for possible hardware issues (disconnected LAN cable, hardware errors. Then check that the two machines can ping each other through the heartbeat LAN. All the errors on both systems relate to network communication between the two systems. Possibly someone has disconnected a LAN cables or changed the configuration of your network switches or routers.

Bill Hassell, sysadmin

Lee seung-jun · ‎04-27-2008

Thanks for your help!

i talk to this case with HP engineer.
he say that time was backup schedule operating...so network I/O was busy.

then CPU was failed processing serviceguard daemon.
and he say i have to update igelan driver.

once more thanks !

Lee seung-jun · ‎04-27-2008

.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Serviceguard restart or reset ?

Serviceguard restart or reset ?