why system went down ?

Jerry_109 · ‎07-19-2006

hello HP,

User called indicating they could not access system. I could not ping, but when I accessed GSP the "ctrl Ecf" did not let me in. I physically noticed the power light was off, so
I turned the power button on, and server came up fine. Can you help me trouble shoot what happened ? I have gathered as much info as I can. I'm not sure if someone just powered off the system. Is there a way to catch a RAT if someone is doing this on purpose ?
=============================================

root@hohp230[/root]
#uname -a ; model
HP-UX hohp230 B.11.11 U 9000/800 100434656 unlimited-user license
9000/800/L3000-7x

/etc/shutdownlog : ( shows "16:52 Tue May 2, 2006" timestamp )

16:51 Fri Jan 20, 2006. Halt: (by hohp230!root)
13:38 Mon Jan 23, 2006. Halt: (by hohp230!root)
15:31 Tue Jan 24, 2006. Halt: (by hohp230!root)
16:52 Tue May 2, 2006. Halt: (by hohp230!root)

#tail -30 OLDsyslog.log : ( shows "Jul 16 20:03:08" as last entry )

Jul 14 19:03:07 hohp230 sshd[20008]: Accepted publickey for root from 10.254.110.2 port 62051 ssh2
Jul 14 19:03:07 hohp230 sshd[20008]: Connection closed by 10.254.110.2
Jul 14 19:03:07 hohp230 sshd[20008]: Closing connection to 10.254.110.2
Jul 15 19:09:47 hohp230 tftpd[15840]: Timeout (no requests in 10 minutes)
Jul 15 20:12:31 hohp230 tftpd[16909]: Timeout (no requests in 10 minutes)
Jul 15 22:00:09 hohp230 LVM[19184]: lvlnboot -v
Jul 15 22:00:11 hohp230 LVM[19321]: lvlnboot -v /dev/vg00
Jul 15 22:00:15 hohp230 LVM[19721]: lvlnboot -v /dev/vg00
Jul 15 22:00:16 hohp230 LVM[19793]: Volume Group configuration for /dev/vg00 has been saved in /etc/lvmconf/vg00.conf
Jul 15 22:00:16 hohp230 LVM[19793]: /usr/sbin/vgcfgbackup /dev/vg00
Jul 15 22:00:16 hohp230 LVM[19794]: /usr/sbin/vgexport -s -p -m /etc/lvmconf/vg00.mapfile /dev/vg00
Jul 15 22:00:17 hohp230 LVM[19795]: Volume Group configuration for /dev/vg02 has been saved in /etc/lvmconf/vg02.conf
Jul 15 22:00:17 hohp230 LVM[19795]: /usr/sbin/vgcfgbackup /dev/vg02
Jul 15 22:00:17 hohp230 LVM[19802]: /usr/sbin/vgexport -s -p -m /etc/lvmconf/vg02.mapfile /dev/vg02
Jul 15 22:00:19 hohp230 LVM[19803]: Volume Group configuration for /dev/vg01 has been saved in /etc/lvmconf/vg01.conf
Jul 15 22:00:19 hohp230 LVM[19803]: /usr/sbin/vgcfgbackup /dev/vg01
Jul 15 22:00:19 hohp230 LVM[19804]: /usr/sbin/vgexport -s -p -m /etc/lvmconf/vg01.mapfile /dev/vg01
Jul 15 22:00:20 hohp230 LVM[19805]: Volume Group configuration for /dev/vg03 has been saved in /etc/lvmconf/vg03.conf
Jul 15 22:00:20 hohp230 LVM[19805]: /usr/sbin/vgcfgbackup /dev/vg03
Jul 15 22:00:20 hohp230 LVM[19806]: /usr/sbin/vgexport -s -p -m /etc/lvmconf/vg03.mapfile /dev/vg03
Jul 16 00:12:44 hohp230 tftpd[22324]: Timeout (no requests in 10 minutes)
Jul 16 03:09:56 hohp230 tftpd[25485]: Timeout (no requests in 10 minutes)
Jul 16 15:29:00 hohp230 tftpd[9082]: Timeout (no requests in 10 minutes)
Jul 16 20:03:08 hohp230 sshd[14178]: Connection from 10.254.110.2 port 49821
Jul 16 20:03:08 hohp230 sshd[14178]: Failed none for root from 10.254.110.2 port 49821 ssh2
Jul 16 20:03:08 hohp230 sshd[14178]: Found matching RSA key: 52:bb:da:8b:ca:9f:73:34:60:60:b0:d8:8d:bc:b1:ae
Jul 16 20:03:08 hohp230 sshd[14178]: Found matching RSA key: 52:bb:da:8b:ca:9f:73:34:60:60:b0:d8:8d:bc:b1:ae
Jul 16 20:03:08 hohp230 sshd[14178]: Accepted publickey for root from 10.254.110.2 port 49821 ssh2
Jul 16 20:03:08 hohp230 sshd[14178]: Connection closed by 10.254.110.2
hohp230 sshd[14178]: Closing connection to 10.254.110.2

last -R :
root pts/0 airlock.scif.com Wed Jul 19 10:44 still logged in
reboot system boot Wed Jul 19 10:36 still logged in
root pts/0 airlock.scif.com Thu Jul 13 12:27 - 17:20 (04:53)
root pts/2 airlock.scif.com Wed Jul 12 16:07 - 16:13 (00:06)
root pts/1 airlock.scif.com Wed Jul 12 14:49 - 16:22 (01:33)
root pts/0 airlock.scif.com Sat Jul 8 18:38 - 16:48 (3+22:10)
root pts/1 airlock.scif.com Sat Jul 8 17:54 - 18:16 (00:21)

root@hohp230[/etc]
#tail -20 /etc/rc.log.old ( shows timestamp of "Tue May 2 17:18:34 PDT 2006" )
Database "estlab" warm started.
logout

Start CDE login server
Output from "/sbin/rc3.d/S990dtlogin.rc start":
----------------------------
/sbin/rc3.d/S990dtlogin.rc[98]: HP: not found.

Start Galaxy services Galaxy - Script for starting/stopping Galaxy resident services USAGE: Galaxy [-vm ] [-force] start Brings up Galaxy services on all configured virtual machines. The optional "-vm" switch can be used to start services on a particular machine only. Galaxy will refuse to start if it detects partially installed patched. In such cases you can either install the latest service pack, or start Galaxy with "-force" option and use QiNetix Update to push patches from the CommServe. Galaxy [-vm ] stop Stops Galaxy services on all configured virtual machines. The optional "-vm" switch can be used to stop services on a particular machine only. Galaxy [-vm ] restart This is the same as "Galaxy stop" followed by "Galaxy start" Galaxy [-vm ] list Lists all running Galaxy services for virtual machine . Galaxy help Displays this help message.
Output from "/sbin/rc3.d/S99Galaxy start":
----------------------------
Cleaning up /opt/galaxy/Base/Temp ...
Starting Galaxy services on hohp230.scif.com ...

**************************************************
HP-UX run-level transition completed
Tue May 2 17:18:34 PDT 2006
**************************************************
mwm: I/O error on display:: :10.0

root@hohp230[/var/adm/sa] ( shows system up during these times )

-rw-r--r-- 1 root sys 7681672 Jul 14 23:55 sa14
-rw-r--r-- 1 root sys 7936032 Jul 15 23:55 sa15
-rw-r--r-- 1 root sys 7936032 Jul 16 23:55 sa16
-rw-r--r-- 1 root sys 7681672 Jul 17 23:55 sa17
-rw-r--r-- 1 root sys 7681672 Jul 18 23:55 sa18
-rw-r--r-- 1 root sys 2950576 Jul 19 11:05 sa19

GSP ( SL info for errors ) : time stamp is incorrect ?

Log Entry # 0 :
SYSTEM NAME: hohp230c
DATE: 07/19/2006 TIME: 15:25:59
ALERT LEVEL: 13 = System hang detected via timer popping

SOURCE: 1 = processor
SOURCE DETAIL: 1 = processor general SOURCE ID: 0
PROBLEM DETAIL: 4 = timeout

CALLER ACTIVITY: F = display_activity() update STATUS: 0
CALLER SUBACTIVITY: 00 = implementation dependent
REPORTING ENTITY TYPE: E = HP-UX REPORTING ENTITY ID: 00

0x78E000D41100F000 00000003 00000000 type 15 = Activity Level/Timeout
0x58E008D41100F000 00006A06 130F193B type 11 = Timestamp 07/19/2006 15:25:59
Type CR for next entry, Q CR to quit.

Log Entry # 1 :
SYSTEM NAME: hohp230c
DATE: 03/27/2006 TIME: 23:05:51
ALERT LEVEL: 2 = Non-Urgent operator attention required

SOURCE: 3 = PDH
SOURCE DETAIL: 6 = interconnect medium SOURCE ID: 0
PROBLEM DETAIL: 3 = non-responding, may need GSP reset.

CALLER ACTIVITY: 2 = operation STATUS: 0
CALLER SUBACTIVITY: 02 = platform internal interconnect
REPORTING ENTITY TYPE: 1 = service processor REPORTING ENTITY ID: 00

0x5810082336002020 00006A02 1B170533 type 11 = Timestamp 03/27/2006 23:05:51
Type CR for next entry, - CR for previous entry, Q CR to quit.

Log Entry # 2 :
SYSTEM NAME: hohp230c
DATE: 02/28/2006 TIME: 05:56:52
ALERT LEVEL: 2 = Non-Urgent operator attention required

SOURCE: 3 = PDH
SOURCE DETAIL: 6 = interconnect medium SOURCE ID: 0
PROBLEM DETAIL: 3 = non-responding, may need GSP reset.

CALLER ACTIVITY: 2 = operation STATUS: 0
CALLER SUBACTIVITY: 02 = platform internal interconnect
REPORTING ENTITY TYPE: 1 = service processor REPORTING ENTITY ID: 00

0x5810082336002020 00006A01 1C053834 type 11 = Timestamp 02/28/2006 05:56:52
Type CR for next entry, - CR for previous entry, Q CR to quit.

Steven E. Protter · ‎07-19-2006

Shalom,

Check for a system crash in /var/crash

If you did not get one, see if savecrash is configured

/etc/rd.config.d/savecrash

First variable must be set to 1.

GSP output doesn't tell me much.

SEP

Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com

Tim Nelson · ‎07-19-2006

You are certainly looking at the right places.

Any current info in /var/tombstones ?

I am interested in the " System hang detected via timer popping "

Jerry_109 · ‎07-19-2006

SAVECRASH was commented out as follows, I have now uncommented it :
root@hohp230[/etc/rc.config.d]
#grep SAVECRASH savecrash
# SAVECRASH: Set to 0 to disable saving system crash dumps.
# SAVECRASH=1
# SAVECRASH_DIR:Directory name for system crash dumps. Note: the filesystem
# SAVECRASH_DIR=/var/adm/crash
root@hohp230[/etc/rc.config.d]

#ls -la /var/adm/crash
total 26
drwxr-xr-x 7 root root 1024 Jul 13 13:19 .
drwxr-xr-x 16 adm adm 8192 Jul 19 10:39 ..
drwx------ 4 root sys 1024 Jul 8 16:39 CV
-rwxr-xr-x 1 root root 1 Jul 22 2004 bounds
drwxr-xr-x 2 root root 1024 Jul 8 16:39 crash.0
drwx------ 2 root sys 96 Feb 21 10:56 emc
drwxr-xr-x 2 root root 96 Dec 2 2003 lost+found
dr-xr-xr-x 31 bin bin 1024 Jul 13 13:27 var
root@hohp230[/etc/rc.config.d]
#
root@hohp230[/etc/rc.config.d]
#ls -la /var/adm/crash/var
total 46
dr-xr-xr-x 31 bin bin 1024 Jul 13 13:27 .
drwxr-xr-x 7 root root 1024 Jul 13 13:19 ..
drwxr-xr-x 3 bin bin 96 Jul 13 13:26 X11
drwxr-xr-x 16 adm adm 1024 Jul 13 13:23 adm
drwxr-xr-x 4 bin bin 96 Jul 13 13:26 asx
drwxrwxrwx 3 bin bin 2048 Jul 13 13:27 depot
dr-xr-xr-x 9 bin bin 1024 Jul 13 13:26 dmi
drwxr-xr-x 4 root sys 96 Jul 13 13:26 dt
drwx------ 2 root sys 96 Jul 13 13:26 emc
drwxrwxrwx 5 root other 1024 Jul 13 13:26 emcgrab
drwxr-xr-x 2 root sys 96 Jul 8 2004 empty
drwxrwxrwt 2 bin bin 96 Mar 14 14:35 home
lrwxr-xr-x 1 root sys 13 Jul 13 13:26 ifor -> /var/opt/ifor
drwxrwxrwx 3 root sys 96 Jul 13 13:26 log
drwxr-xr-x 2 root root 96 Jul 13 13:20 lost+found
drwxrwxr-x 2 bin mail 96 Jul 13 13:26 mail
drwxrwxrwx 2 bin bin 96 May 20 2003 news
dr-xr-xr-x 4 bin bin 96 Jul 13 13:26 obam
dr-xr-xr-x 22 bin bin 1024 Jul 13 13:26 opt
dr-xr-xr-x 2 bin bin 96 May 20 2003 parmgr
drwxrwxrwx 2 bin bin 96 May 20 2003 preserve
drwxrwxrwx 2 bin bin 1024 Jul 13 13:26 rbootd
dr-xr-xr-x 2 bin bin 1024 Jul 13 13:26 run
dr-xr-xr-x 9 bin bin 1024 Jul 13 13:26 sam
drwxr-xr-x 13 root sys 1024 Jul 13 13:26 spool
drwxr-xr-x 4 root root 96 Jul 13 13:26 statmon
drwxrwxrwx 5 root other 96 Jul 13 13:26 stm
lrwx------ 1 root sys 19 Jul 13 13:26 symapi -> /usr/emc/API/symapi
drwxrwxrwx 2 bin bin 96 Jul 13 13:26 test
drwxrwxrwx 8 bin bin 8192 Jul 13 13:26 tmp
drwxr-xr-x 2 root root 2048 Jul 13 13:26 tombstones
dr-xr-xr-x 6 bin bin 96 Jul 13 13:26 uucp
drwxr-xr-x 2 bin bin 1024 Jul 13 13:26 yp
root@hohp230[/etc/rc.config.d]

+++++++++++++++++++++++++++++++++++

root@hohp230[/var/tombstones]
#grep -i hang ts99
root@hohp230[/var/tombstones]

Jerry_109 · ‎07-19-2006

root@hohp230[/var/tombstones]
#egrep -i "hang|timer|popping" ts99
root@hohp230[/var/tombstones]
#grep -i hpmc ts99
------- Processor 0 HPMC Information - PDC Version: 44.12 ------
No HPMC chassis codes logged
----------------- DEW 0 HPMC Information - ------
------- Processor 1 HPMC Information - PDC Version: 44.12 ------
No HPMC chassis codes logged
----------------- DEW 1 HPMC Information - ------
------- Processor 2 HPMC Information - PDC Version: 44.12 ------
No HPMC chassis codes logged
----------------- DEW 2 HPMC Information - ------
------- Processor 3 HPMC Information - PDC Version: 44.12 ------
No HPMC chassis codes logged
----------------- DEW 3 HPMC Information - ------
root@hohp230[/var/tombstones]

Patrick Wallek · ‎07-19-2006

There is not a whole lot to go on here.

If the system panic'ed or was shutdown with the shutdown command, then there should be a line in /etc/shutdownlog.

The items you have in /var/adm/crash/var is NOT the normal output from a system crash. I'm not sure what those are. It does appear that you had a system crash on Jul 8 though.

Is this system connected to a UPS? If so, check the UPS and see if there were any power interruptions and if the UPS ran its battery down such that your server just lost power.

Mel Burslan · ‎07-19-2006

This definitely is not looking like a panic induced crash as there is no panic related entry in the /etc/shutdownlog.

It sounds like either someone has done a reset from the console and not owning up to it, or there was a power interruption.

________________________________
UNIX because I majored in cryptology...

Jerry_109 · ‎07-19-2006

Thanks for all your info. I'm giving up on the search.

sathish kannan · ‎07-20-2006

Hello Jerry,
If you can't see any errors on /var/adm/syslog/syslog.log or /var/tombstones or any crash files, it is virtually impossible to find the reason.

Why can't you log a call with HP and let them find out if there are any issues with your hardware?

Regards
Sathish

Don't Think too much

Kent Ostby · ‎07-20-2006

Check the INTERNAL timestamp of the file /var/tombstones/ts99.

The external timestamp will just show you when the machine rebooted.

If the internal timestamp to the ts99 file has a July date, then you had a hardware error of some sort and you should open a call with HP hardware support to troubleshoot it.

"Well, actually, she is a rocket scientist" -- Steve Martin in "Roxanne"

Steven E. Protter · ‎07-20-2006

Don't give up.

You have a crash file.

Perform q4 dump analysis on it.

This thread says how to do it:
http://forums1.itrc.hp.com/service/forums/questionanswer.do?threadId=1028989

I noticed this procedure which I translated from HPUX-ese is not on my web site. I'll see about getting it up there.

SEP

Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

why system went down ?

why system went down ?

Re: why system went down ?

Re: why system went down ?

Re: why system went down ?

Re: why system went down ?

Re: why system went down ?

Re: why system went down ?

Re: why system went down ?

Re: why system went down ?

Re: why system went down ?

Re: why system went down ?