Re: Server Rebooted Automatically

Fergus Brophy · ‎10-28-2009

Hello Guys,
I have a server that rebooted automatically late at night. I am trying to get to the bottom of it. I can see the following entry in the /etc/shutdownlog file.
04:24 Tue Oct 27 2009. Reboot after panic: SafetyTimer expired, INIT, IIP:0xe0000000013fe3d0 IFA:0xe000000180f5a300

Would anyone know anything about the above message. Thanks you.

Fergus Brophy · ‎10-28-2009

Sorry People, abit more information.
Itainium box running HP-UX 11.23. The server is part of a node cluster running SG.

Duncan Edmonstone · ‎10-28-2009

and the message indicates that the system was TOC'd because the kernel safety timer popped - whioch means some sort of cluster related issue - can you post the end of OLDsyslog.log for this system and the similar time period for the other node(s) in the cluster...

HTH

Duncan

I am an HPE Employee

Fergus Brophy · ‎10-28-2009

Hello Duncan,

Below are the extracts of the syslog. There was no entry in the OLDsyslog.log of the node that replied. It just had the usual syslog messages and that was it. I don't see any reference to any errors on NODE1.

NODE1

END Of OLDsyslog.log

Oct 27 03:30:03 NODE1 su: + tty?? root-root
Oct 27 03:31:11 NODE1 su: + tty?? root-root
Oct 27 03:31:52 NODE1 above message repeats 15 times

Start of syslog.log

Oct 27 04:24:23 NODE1 syslogd: restart
Oct 27 04:24:23 NODE1 vmunix: Found adjacent data tr. Growing size. 0x348c000 -> 0x748c000.
Oct 27 04:24:23 NODE1 vmunix: Pinned PDK malloc pool: base: 0xe000000100b74000 size=119344K
Oct 27 04:24:23 NODE1 vmunix: Loaded ACPI revision 2.0 tables.
Oct 27 04:24:23 NODE1 vmunix: MMIO on this platform supports Write Coalescing.
Oct 27 04:24:23 NODE1 vmunix:
Oct 27 04:24:23 NODE1 vmunix: MFS is defined: base= 0xe000000100b74000 size= 1368 KB
Oct 27 04:24:23 NODE1 vmunix: Unpinned PDK malloc pool: base: 0xe000000108000000 size=131072K
Oct 27 04:24:23 NODE1 vmunix: NOTICE: cachefs_link(): File system was registered at index 5.
Oct 27 04:24:23 NODE1 vmunix: NOTICE: nfs3_link(): File system was registered at index 8.

NODE 2

Syslog.log

Oct 27 04:15:22 NODE2 cmcld: Timed out node NODE1. It may have failed.
Oct 27 04:15:22 NODE2 cmcld: Attempting to adjust cluster membership
Oct 27 04:15:22 NODE2 cmcld: Beginning standard partial election
Oct 27 04:15:23 NODE2 cmclconfd[1853]: Updated file /var/adm/cmcluster/frdump.cmcld.4 for node NODE2 (length = 512096).
Oct 27 04:15:28 NODE2 cmcld: Clearing Cluster Lock
Oct 27 04:15:28 NODE2 cmcld: Resumed updating safety time
Oct 27 04:15:32 NODE2 cmcld: Heartbeat connection attempt to node NODE1 timed out
Oct 27 04:15:33 NODE2 cmclconfd[1853]: Updated file /var/adm/cmcluster/frdump.cmcld.5 for node NODE2 (length = 10124).
Oct 27 04:16:50 NODE2 cmcld: 2 nodes have formed a new cluster, sequence #26
Oct 27 04:16:50 NODE2 cmcld: The new active cluster membership is: NODE2(id=2), NODE3(id=3)
Oct 27 04:16:50 NODE2 cmcld: One of the nodes is down.
Oct 27 04:17:30 NODE2 cmcld: (NODE3) Started package temip1_TeMIP on node NODE3.
Oct 27 04:25:16 NODE2 cmcld: New node NODE1 is joining the cluster
Oct 27 04:25:16 NODE2 cmcld: Attempting to adjust cluster membership
Oct 27 04:25:16 NODE2 cmcld: Beginning standard partial election
Oct 27 04:25:16 NODE2 cmcld: Clearing Cluster Lock
Oct 27 04:25:19 NODE2 cmcld: 3 nodes have formed a new cluster, sequence #27
Oct 27 04:25:19 NODE2 cmcld: The new active cluster membership is: NODE2(id=2), NODE3(id=3), NODE1(id=1)
Oct 27 04:25:20 NODE2 cmcld: Resumed updating safety time

NODE 3
syslog.log

Oct 27 04:15:22 NODE3 cmcld: Attempting to adjust cluster membership
Oct 27 04:15:22 NODE3 cmcld: Beginning standard partial election
Oct 27 04:15:28 NODE3 cmcld: Resumed updating safety time
Oct 27 04:16:50 NODE3 cmcld: 2 nodes have formed a new cluster, sequence #26
Oct 27 04:16:50 NODE3 cmcld: The new active cluster membership is: NODE2(id=2), NODE3(id=3)
Oct 27 04:16:50 NODE3 cmcld: One of the nodes is down.
Oct 27 04:16:50 NODE3 cmcld: Request from node NODE2 to start package temip1_TeMIP on node NODE3.
Oct 27 04:16:50 NODE3 cmcld: Executing '/etc/cmcluster/temip1_TeMIP/temip1_TeMIP.cntl start' for package temip1_TeMIP, as service PKG*29698.
Oct 27 04:17:30 NODE3 cmcld: Service PKG*29698 terminated due to an exit(0).
Oct 27 04:17:30 NODE3 cmcld: Started package temip1_TeMIP on node NODE3.
Oct 27 04:25:16 NODE3 cmcld: Attempting to adjust cluster membership
Oct 27 04:25:16 NODE3 cmcld: Beginning standard partial election
Oct 27 04:25:17 NODE3 cmcld: Resumed updating safety time
Oct 27 04:25:19 NODE3 cmcld: 3 nodes have formed a new cluster, sequence #27
Oct 27 04:25:19 NODE3 cmcld: The new active cluster membership is: NODE2(id=2), NODE3(id=3), NODE1(id=1)

Fabian Briseño · ‎10-28-2009

Hello fergus.
Was a crash dump log created in /var/adm/crash ??

If so you can get HP to analize the dump, provided you have a contract support.

Knowledge is power.

Raj D. · ‎10-28-2009

Fergus,

- 3:31 to 4:24 : It looks like node1 freezed or it was down . Reason unknow.
- 4:24 : Node1 back up after reboot.

- 4:15 It is detected that NODE1 may have failed. by Node2 ,
- 4:25 Cluster formed. Everything is UP.

You need to check if everything is ok on node1. And also check lan connections on node1. It looks like it was stopped responding in the cluste and it crashed due to safety Timer expired:
- check /var/adm/crash/ from crash dump.
- /var/adm/tombstones/ if any file is there.
- check lan connections and nettl.LOG00

Hth,
Raj.

" If u think u can , If u think u cannot , - You are always Right . "

Viveki · ‎10-28-2009

Hi

The safety timer expired error occures when the heart beat LAN has some issues. You can check the nettl logs for any possible clue.

Fergus Brophy · ‎10-28-2009

Hi guys,
I have checked the nettl logs. I don't see any issue in there. There are a few entries, but they occur at the time of the system shutdown, so i assume they have been generated due to the shutdown. I have attached them for you anyways. Is there any other way i can check to see what caused this. There doesn't seem to be any errors relating to the heartbeat lan in any of the logs.

-------------------100BT/Gigabit Ethernet LAN/9000 Networking---------------@#%
Timestamp : Tue Oct 27 GMT 2009 04:24:35.948419
Process ID : [ICS] Subsystem : IETHER
User ID ( UID ) : -1 Log Class : ERROR
Device ID : 0 Path ID : 0
Connection ID : 0 Log Instance : 0
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

<2000> 1000Base-T in path 0/1/2/0
Detected a faulty or disconnected cable.

-------------------100BT/Gigabit Ethernet LAN/9000 Networking---------------@#%
Timestamp : Tue Oct 27 GMT 2009 04:24:36.025734
Process ID : [ICS] Subsystem : IETHER
User ID ( UID ) : -1 Log Class : ERROR
Device ID : 6 Path ID : 0
Connection ID : 0 Log Instance : 0
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

<2000> 1000Base-T in path 0/4/1/0/4/0
Detected a faulty or disconnected cable.

-------------------100BT/Gigabit Ethernet LAN/9000 Networking---------------@#%
Timestamp : Tue Oct 27 GMT 2009 04:24:37.261042
Process ID : [ICS] Subsystem : IETHER
User ID ( UID ) : -1 Log Class : ERROR
Device ID : 6 Path ID : 0
Connection ID : 0 Log Instance : 0
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

<2000> 1000Base-T in path 0/4/1/0/4/0
Detected a faulty or disconnected cable.

Raj D. · ‎10-28-2009

Fergus,

<2000> 1000Base-T in path 0/1/2/0
Detected a faulty or disconnected cable.

-------------------100BT/Gigabit Ethernet LAN/9000 Networking---------------@#%
Timestamp : Tue Oct 27 GMT 2009 04:24:36.025734
Process ID : [ICS] Subsystem : IETHER
User ID ( UID ) : -1 Log Class : ERROR
Device ID : 6 Path ID : 0
Connection ID : 0 Log Instance : 0
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~

May be generated during shutdown, so this information may not very accurate.

check if there is any cpu time stamp in /var/adm/tomobtones/ts99 file
(May be system was brougt down due to a bad cpu or bad pci card ).
- check gsp logs if any thing can be found. You can filter out with code 7 (Fatal).

However: SafetyTimer expired, INIT :
-"INIT" said the safety timer expiration is mostly due to software issue and may not be a hardware issue.

- Finally :You can engage HP to analyse the dump in /var/adm/crash/crash.x/

Hth,
Raj.

" If u think u can , If u think u cannot , - You are always Right . "

Viveki · ‎10-28-2009

Hi Fergus,

It is the other way. The entries are there not because of the system shutdown. The system shutdowned because of the entries or simply, there was a network disconection and the heart beat Interface failed.

There wont be any entries in netlog because of a shutdown. You can compare any previous entries with your old shutdown logs.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Server Rebooted Automatically

Server Rebooted Automatically