Server Rebooted Automatically

Fergus Brophy · ‎10-28-2009

Hello Guys,
I have a server that rebooted automatically late at night. I am trying to get to the bottom of it. I can see the following entry in the /etc/shutdownlog file.
04:24 Tue Oct 27 2009. Reboot after panic: SafetyTimer expired, INIT, IIP:0xe0000000013fe3d0 IFA:0xe000000180f5a300

Would anyone know anything about the above message. Thanks you.

Fergus Brophy · ‎10-28-2009

Sorry People, abit more information.
Itainium box running HP-UX 11.23. The server is part of a node cluster running SG.

Duncan Edmonstone · ‎10-28-2009

and the message indicates that the system was TOC'd because the kernel safety timer popped - whioch means some sort of cluster related issue - can you post the end of OLDsyslog.log for this system and the similar time period for the other node(s) in the cluster...

HTH

Duncan

I am an HPE Employee

Fergus Brophy · ‎10-28-2009

Hello Duncan,

Below are the extracts of the syslog. There was no entry in the OLDsyslog.log of the node that replied. It just had the usual syslog messages and that was it. I don't see any reference to any errors on NODE1.

NODE1

END Of OLDsyslog.log

Oct 27 03:30:03 NODE1 su: + tty?? root-root
Oct 27 03:31:11 NODE1 su: + tty?? root-root
Oct 27 03:31:52 NODE1 above message repeats 15 times

Start of syslog.log

Oct 27 04:24:23 NODE1 syslogd: restart
Oct 27 04:24:23 NODE1 vmunix: Found adjacent data tr. Growing size. 0x348c000 -> 0x748c000.
Oct 27 04:24:23 NODE1 vmunix: Pinned PDK malloc pool: base: 0xe000000100b74000 size=119344K
Oct 27 04:24:23 NODE1 vmunix: Loaded ACPI revision 2.0 tables.
Oct 27 04:24:23 NODE1 vmunix: MMIO on this platform supports Write Coalescing.
Oct 27 04:24:23 NODE1 vmunix:
Oct 27 04:24:23 NODE1 vmunix: MFS is defined: base= 0xe000000100b74000 size= 1368 KB
Oct 27 04:24:23 NODE1 vmunix: Unpinned PDK malloc pool: base: 0xe000000108000000 size=131072K
Oct 27 04:24:23 NODE1 vmunix: NOTICE: cachefs_link(): File system was registered at index 5.
Oct 27 04:24:23 NODE1 vmunix: NOTICE: nfs3_link(): File system was registered at index 8.

NODE 2

Syslog.log

Oct 27 04:15:22 NODE2 cmcld: Timed out node NODE1. It may have failed.
Oct 27 04:15:22 NODE2 cmcld: Attempting to adjust cluster membership
Oct 27 04:15:22 NODE2 cmcld: Beginning standard partial election
Oct 27 04:15:23 NODE2 cmclconfd[1853]: Updated file /var/adm/cmcluster/frdump.cmcld.4 for node NODE2 (length = 512096).
Oct 27 04:15:28 NODE2 cmcld: Clearing Cluster Lock
Oct 27 04:15:28 NODE2 cmcld: Resumed updating safety time
Oct 27 04:15:32 NODE2 cmcld: Heartbeat connection attempt to node NODE1 timed out
Oct 27 04:15:33 NODE2 cmclconfd[1853]: Updated file /var/adm/cmcluster/frdump.cmcld.5 for node NODE2 (length = 10124).
Oct 27 04:16:50 NODE2 cmcld: 2 nodes have formed a new cluster, sequence #26
Oct 27 04:16:50 NODE2 cmcld: The new active cluster membership is: NODE2(id=2), NODE3(id=3)
Oct 27 04:16:50 NODE2 cmcld: One of the nodes is down.
Oct 27 04:17:30 NODE2 cmcld: (NODE3) Started package temip1_TeMIP on node NODE3.
Oct 27 04:25:16 NODE2 cmcld: New node NODE1 is joining the cluster
Oct 27 04:25:16 NODE2 cmcld: Attempting to adjust cluster membership
Oct 27 04:25:16 NODE2 cmcld: Beginning standard partial election
Oct 27 04:25:16 NODE2 cmcld: Clearing Cluster Lock
Oct 27 04:25:19 NODE2 cmcld: 3 nodes have formed a new cluster, sequence #27
Oct 27 04:25:19 NODE2 cmcld: The new active cluster membership is: NODE2(id=2), NODE3(id=3), NODE1(id=1)
Oct 27 04:25:20 NODE2 cmcld: Resumed updating safety time

NODE 3
syslog.log

Oct 27 04:15:22 NODE3 cmcld: Attempting to adjust cluster membership
Oct 27 04:15:22 NODE3 cmcld: Beginning standard partial election
Oct 27 04:15:28 NODE3 cmcld: Resumed updating safety time
Oct 27 04:16:50 NODE3 cmcld: 2 nodes have formed a new cluster, sequence #26
Oct 27 04:16:50 NODE3 cmcld: The new active cluster membership is: NODE2(id=2), NODE3(id=3)
Oct 27 04:16:50 NODE3 cmcld: One of the nodes is down.
Oct 27 04:16:50 NODE3 cmcld: Request from node NODE2 to start package temip1_TeMIP on node NODE3.
Oct 27 04:16:50 NODE3 cmcld: Executing '/etc/cmcluster/temip1_TeMIP/temip1_TeMIP.cntl start' for package temip1_TeMIP, as service PKG*29698.
Oct 27 04:17:30 NODE3 cmcld: Service PKG*29698 terminated due to an exit(0).
Oct 27 04:17:30 NODE3 cmcld: Started package temip1_TeMIP on node NODE3.
Oct 27 04:25:16 NODE3 cmcld: Attempting to adjust cluster membership
Oct 27 04:25:16 NODE3 cmcld: Beginning standard partial election
Oct 27 04:25:17 NODE3 cmcld: Resumed updating safety time
Oct 27 04:25:19 NODE3 cmcld: 3 nodes have formed a new cluster, sequence #27
Oct 27 04:25:19 NODE3 cmcld: The new active cluster membership is: NODE2(id=2), NODE3(id=3), NODE1(id=1)

Fabian Briseño · ‎10-28-2009

Hello fergus.
Was a crash dump log created in /var/adm/crash ??

If so you can get HP to analize the dump, provided you have a contract support.

Knowledge is power.

Raj D. · ‎10-28-2009

Fergus,

- 3:31 to 4:24 : It looks like node1 freezed or it was down . Reason unknow.
- 4:24 : Node1 back up after reboot.

- 4:15 It is detected that NODE1 may have failed. by Node2 ,
- 4:25 Cluster formed. Everything is UP.

You need to check if everything is ok on node1. And also check lan connections on node1. It looks like it was stopped responding in the cluste and it crashed due to safety Timer expired:
- check /var/adm/crash/ from crash dump.
- /var/adm/tombstones/ if any file is there.
- check lan connections and nettl.LOG00

Hth,
Raj.

" If u think u can , If u think u cannot , - You are always Right . "

Viveki · ‎10-28-2009

Hi

The safety timer expired error occures when the heart beat LAN has some issues. You can check the nettl logs for any possible clue.

Fergus Brophy · ‎10-28-2009

Hi guys,
I have checked the nettl logs. I don't see any issue in there. There are a few entries, but they occur at the time of the system shutdown, so i assume they have been generated due to the shutdown. I have attached them for you anyways. Is there any other way i can check to see what caused this. There doesn't seem to be any errors relating to the heartbeat lan in any of the logs.

-------------------100BT/Gigabit Ethernet LAN/9000 Networking---------------@#%
Timestamp : Tue Oct 27 GMT 2009 04:24:35.948419
Process ID : [ICS] Subsystem : IETHER
User ID ( UID ) : -1 Log Class : ERROR
Device ID : 0 Path ID : 0
Connection ID : 0 Log Instance : 0
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

<2000> 1000Base-T in path 0/1/2/0
Detected a faulty or disconnected cable.

-------------------100BT/Gigabit Ethernet LAN/9000 Networking---------------@#%
Timestamp : Tue Oct 27 GMT 2009 04:24:36.025734
Process ID : [ICS] Subsystem : IETHER
User ID ( UID ) : -1 Log Class : ERROR
Device ID : 6 Path ID : 0
Connection ID : 0 Log Instance : 0
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

<2000> 1000Base-T in path 0/4/1/0/4/0
Detected a faulty or disconnected cable.

-------------------100BT/Gigabit Ethernet LAN/9000 Networking---------------@#%
Timestamp : Tue Oct 27 GMT 2009 04:24:37.261042
Process ID : [ICS] Subsystem : IETHER
User ID ( UID ) : -1 Log Class : ERROR
Device ID : 6 Path ID : 0
Connection ID : 0 Log Instance : 0
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

<2000> 1000Base-T in path 0/4/1/0/4/0
Detected a faulty or disconnected cable.

Raj D. · ‎10-28-2009

Fergus,

<2000> 1000Base-T in path 0/1/2/0
Detected a faulty or disconnected cable.

-------------------100BT/Gigabit Ethernet LAN/9000 Networking---------------@#%
Timestamp : Tue Oct 27 GMT 2009 04:24:36.025734
Process ID : [ICS] Subsystem : IETHER
User ID ( UID ) : -1 Log Class : ERROR
Device ID : 6 Path ID : 0
Connection ID : 0 Log Instance : 0
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~

May be generated during shutdown, so this information may not very accurate.

check if there is any cpu time stamp in /var/adm/tomobtones/ts99 file
(May be system was brougt down due to a bad cpu or bad pci card ).
- check gsp logs if any thing can be found. You can filter out with code 7 (Fatal).

However: SafetyTimer expired, INIT :
-"INIT" said the safety timer expiration is mostly due to software issue and may not be a hardware issue.

- Finally :You can engage HP to analyse the dump in /var/adm/crash/crash.x/

Hth,
Raj.

" If u think u can , If u think u cannot , - You are always Right . "

Viveki · ‎10-28-2009

Hi Fergus,

It is the other way. The entries are there not because of the system shutdown. The system shutdowned because of the entries or simply, there was a network disconection and the heart beat Interface failed.

There wont be any entries in netlog because of a shutdown. You can compare any previous entries with your old shutdown logs.

Michael Steele_2 · ‎10-28-2009

Hi

Looks like you have stepped upon an already known problem with a patch fix.

http://www13.itrc.hp.com/service/cki/docDisplay.do?docLocale=en&docId=emr_na-c01878548-2

http://www13.itrc.hp.com/service/cki/docDisplay.do?docLocale=en&docId=emr_na-c00905839-9

The patches are embedded in the kmine doc.s

Support Fatherhood - Stop Family Law

Fergus Brophy · ‎10-29-2009

Hello Michael,

I don't have access to those links. Would you be able to explain what was in them?

Thanks very much.

Fergus Brophy · ‎10-29-2009

Hello Viveki,
Are you sure about that. I have looked at other systems and i can see the same errors in that the nettl.log00 file. And by comparing the timestamps of the cable faulty or disconnect with the uptime on the server, it seems that this error occurs everytime the servers are shutdown.

Thanks.

Michael Steele_2 · ‎10-29-2009

PHSS_40145: 11.31 Serviceguard A.11.19.00
ABORT PANIC If cmcld receives unexpected data cmcld may hang resulting in a node TOC. The following messages will be logged in flight recorder
log SEC:01: Event - Unknown message version

See Attached

Support Fatherhood - Stop Family Law

Michael Steele_2 · ‎10-29-2009

Well, this is a personal question for you. Did you shutdown the server before halting the node?

IMPROPER SHUTDOWN ____

Another reason for a ServiceGuard TOC may be due to performing a shutdown or
the reboot command before taking the node out of the Serviceguard

A symptom of this is often recorded in the /etc/shutdownlog:

21:22 Tue Sep 06 2005. Reboot after panic: SafetyTimer expired, INIT,
IIP:0xe000000000643680 IFA:0xe0000001f8fd8056

The shutdown command initiates the "/sbin/init.d/cmcluster stop"
script, which performs a "cmhaltnode". Normally, cmhaltnode signals all
packages to shutdown, terminates all Serviceguard processes and terminates
the kernel safety timer which is used to detect a kernel hang.

If a package fails to halt properly however, cmhaltnode will not terminate
cmcld and the safety-timer process is left running. Consequently the
shutdown command will eventually perform a 'reboot' which will kill cmcld,
leaving the safety timer counting down. If the timer reaches zero before the
O/S shuts down, a TOC occurs.

Owing to the fact that "fuser -ku" is not designed to find and kill
all processes keeping files open, the most common cause of package halt
failure is the inability to umount a file system by the control script. (See
the packages' control log)

The recommended shutdown procedure is to perform cmhaltnode manually prior to
performing the shutdown command.

Support Fatherhood - Stop Family Law

Fergus Brophy · ‎10-29-2009

Thanks Michael,
I think i have the sequence of things clear now. From all the responses along with yours and putting 2 and 2 together, it looks as if, Node 1 lost connectivity with the cluster, the package tried to halt but failed to come down cleanly, due to a user in the mounted file system. Hence the safety timer was not stopped and the node rebooted as a result. All that is left is I have to figure out why the server lost connection with the heartbeat lan.
Thanks.

Michael Steele_2 · ‎10-29-2009

Hi

Well, I have to disagree with this comment "... due to a user in the mounted file system..."

In this case, the error in syslog will be vg unable to deactivate. And its a fairly common occurance that shows up here in the forum pretty regular. Not like your problem, which is hard to find search hits on.

Support Fatherhood - Stop Family Law

purushottamaher · ‎11-22-2014

Hi,

i Also had the same issue and i got the solution for the same issue here :

http://expertisenpuru.com/reboot-after-panic-server-rebooted-automatically-in-hp-ux/

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Server Rebooted Automatically

Server Rebooted Automatically

Re: Server Rebooted Automatically

Re: Server Rebooted Automatically

Re: Server Rebooted Automatically

Re: Server Rebooted Automatically

Re: Server Rebooted Automatically

Re: Server Rebooted Automatically

Re: Server Rebooted Automatically

Re: Server Rebooted Automatically

Re: Server Rebooted Automatically

Re: Server Rebooted Automatically

Re: Server Rebooted Automatically

Re: Server Rebooted Automatically

Re: Server Rebooted Automatically

Re: Server Rebooted Automatically

Re: Server Rebooted Automatically

Re: Server Rebooted Automatically

Re: Server Rebooted Automatically