Operating System - HP-UX
1829102 Members
2378 Online
109986 Solutions
New Discussion

What To Check When Sys Crashes

 
SOLVED
Go to solution
Alex Ferreira
Frequent Advisor

What To Check When Sys Crashes

G'day,

I have a little D370 running Hp v11.
For some reason the beast decided to crash yesterday. Only thing I could do was ctrl B from the terminal and RS.
My question is, when something like this happens what are things you wonderful people check?
I checked syslogs, rc.logs, messages, crash logs. Unfortunately I have not been able to find an answer to why the system crashed and I have management asking why.

Thanking anyone for any info...
11 REPLIES 11
Geoff Wild
Honored Contributor
Solution

Re: What To Check When Sys Crashes

Anything in /var/adm/crash?

If yes, then you can decode that with q4.

cd /var/adm/crash/crash.0

# /usr/contrib/Q4/bin/q4 -p .

(note the "dot" at the end of the command)

At the q4> prompt, type:


q4> run Analyze AU > ana.out


q4> run WhatHappened -HANG > what.out

NOTE: ctrl-c can interrupt these two commands, which may take several minutes to process.


To exit q4:


q4> exit



Rgds...Geoff
Proverbs 3:5,6 Trust in the Lord with all your heart and lean not on your own understanding; in all your ways acknowledge him, and he will make all your paths straight.
Denver Osborn
Honored Contributor

Re: What To Check When Sys Crashes

If this happens again, rather than RS you need to TC the box. Using TC will create a system dump that could be analyzed using q4 and help determine what caused the box to hang.

Other than the logs you mentioned already, you could try looking at /var/opt/resmon/log/event.log. It's hard to say if you'll find anything.

-denver
Rick Garland
Honored Contributor

Re: What To Check When Sys Crashes

In addition to the q4, maybe some other places to look...

/var/tombstones
/etc/shutdownlog
/var/adm/syslog/syslog.log

The q4 will provide the majority of the info but sometimes you can get crash info from some of these other files. If you have no support this could provide other options
Bill Hassell
Honored Contributor

Re: What To Check When Sys Crashes

A 'normal' crash occurs when the kernel discovers corrupted addresses or data elements within itself. The corruption is most commonly due to missing patches or occasionally due to a hardware failure. But in a true kernel crash (called a panic condition) the OS is halted and memory is copied to the dump area, followed by an automatic reboot.

It sounds like this did not happen so the problem was more likely a hang where the system appeared to be dead and unresponsive. Hangs cannot be diagnosed from logs because the OS stopped running or ended up in an endless loop. The only way to diagnose this condition is to use CTRL-B and then TC rather than RS. That will create a memory dump which can be analyzed as to the reason for the hang.

Understanding the memory dump and diagnosing a fix is almost impossible without a lot of OS internals training, so you'll need to hand the dump over to HP for analysis. The alternative is to bring the system up to date on patches.


Bill Hassell, sysadmin
Alex Ferreira
Frequent Advisor

Re: What To Check When Sys Crashes

Gents,

thanks heaps for you responses. I will keep the TC in mind the next time one of my babies misbehaves...

Thumbs up to all that replied..

Alex
Raj D.
Honored Contributor

Re: What To Check When Sys Crashes

Alex,

You also need to check :

1. /var/tombstones/ts99 file for valid timestamps.
2. also /etc/shutdownlog ( For detail error like panic: .....
3. You can run HP Collector script ( collector.sh ) if you have , with crashinfo command , and to send the output file to HP for debuging the error, as what exactly caused the crash,

, There is three reason for crash , [ PANIC:crash caused by OS , TOC : by TOC , and HPMC: hw related. ] and Q4 analysis also can be useful and can be run to check the cause of the problem, but its quite lengthy , better its prefered method to send the dump to HP.

Hth,
Raj.
" If u think u can , If u think u cannot , - You are always Right . "
Alex Ferreira
Frequent Advisor

Re: What To Check When Sys Crashes

Raj,

could you please explain a little more on the collector script. IE: where can I get that as we dont have it? Also, what command is that crashinfo? I had a look at the man pages but did not find anything on crashinfo?

Or, are you saying to send the crash information over to Hp to get it investigated?

Alex
Raj D.
Honored Contributor

Re: What To Check When Sys Crashes

Alex,


" are you saying to send the crash information over to Hp to get it investigated? "

Well, Not really, as the dump files big in size, in GB, so HP usually sends the script and to be run by you and to send the small output to them, thats it.



Well,
I think i do not have the script right now with me, this is the script given by hp when I logged the call after one system crashed, after leaving the dumps on /var/adm/crash/crash.0/...

And they will send you the file named 1. COLLECTION.sh (A big shell script: ,and 2. crashinfo.zip (zipped binary file : size 876KB ), attached here with.

Procedure: You need to copy the crashinfo binary file into /var/adm/crash , and run the collection.sh script, and it will generate a report .tz format [contain all system info and other details including ts99 , and ts99.tracefile.txt (important one) ], And once you send this report , they will tell you the cause for the crash. Many time it happens due to hardware failuer as well. If I remember correctly ,for few of my crash occasion ,HP replaced one cpu once and once one scsi card,



Cheers,
Raj.
" If u think u can , If u think u cannot , - You are always Right . "
Alex Ferreira
Frequent Advisor

Re: What To Check When Sys Crashes

Thank you Raj for the reply.

Alex
Raj D.
Honored Contributor

Re: What To Check When Sys Crashes

Alex,

Again, If you run the crashinfo command it will generate the result file and that will have the cause , check below: (You need same size of space on /var as the size of the dump file, or it will give error.)


# cd /var/adm/crash.0
#./crashinfo -c 2> /dev/null 1> crash0_crashinfo.txt 2>&1

And the result file is here: crash.0_crashingo.txt


Just pasting few lines from the crash.0_crashinfo.txt file to make it easy:




=======================================

crashinfo (3.9)
libp4 (8.47): Opening ./vmunix ./INDEX

Loading symbols from ./vmunix
Kernel TEXT pages not requested in crashconf
Will use an artificial mapping from a.out TEXT pages

crashinfo (3.9) output

=====================
= Table Of Contents =
=====================


* General Information
* Crash Events
* Message Buffer
* Memory Globals
* Buffer Cache Globals
* Swap Information
* Global Error Counters / kmem_writes
* Network Interfaces
* IOVA Usage Check
* Crash Event / Processor Information
* Processor Clock Info
* Syswait Array
* Load Averages
* Thread Information
* Kernel Patches

=======================
= General Information =
=======================

Dump time Thu Jul 28 13:13:31 2005 UTC4
System has been up 184 days, 23 hours, 50 minutes.

System Name : HP-UX
Node Name : server20
Model : 9000/800/SD32000
HP-UX version : B.11.11 (64-bit Kernel)
Number of CPU's : 8
Disabled CPU's : 0
CPU type : PCXW+ (875 Mhz)
CPU Architecture : PA-RISC 2.0
Load average : 0.34 0.27 0.26

================
= Crash Events =
================


Note: Crash event 0 was a PANIC !

Panic string :


Note: In the case of a PANIC, normally crash event 0 is the crash
event you should concentrate on. There may well be other secondary
panics (for example spinlock panics) that have happened as a
consequence of the original panic.


Stack Trace for Crash event 0
=============================

============== EVENT ============================
= Event #0 is PANIC on CPU #6
= p crash_event_t 0xabf000
= p rpb_t 0xab8100
= Using pc from pim.wide.rp_rp_hi = 0x233724
============== EVENT ============================
SR5=0x01cdb400
SP RP Return Name
0x400003ffffff1338 0x00233724 panic+0x6c
0x400003ffffff1298 0x00038400 fdc_target_miss_PCXU
0x400003ffffff1248 0x000378b4 fdcache_conditionally+0x90
0x400003ffffff11e8 0x001079c8 checkaccess+0x6d0
0x400003ffffff1098 0x00107be0 hdl_pfault+0x158
0x400003ffffff0f48 0x001808d8 pfault+0x120
0x400003ffffff0e68 0x001678cc trap+0x68c
0x400003ffffff0c78 0x0016a444 thandler+0xd20
+------------- TRAP ----------------------------
| Trap type 7 in USER mode at 0xd973c00.0x800003ffbfe9b8f3 (???)
| p struct save_state 0x1cdb400.0x400003ffffff07a8
+------------- TRAP ----------------------------

Stack Trace for Crash event 0 with all args
===========================================

============== EVENT ============================
= Event #0 is PANIC on CPU #6
= p crash_event_t 0xabf000
= p rpb_t 0xab8100
= Using pc from pim.wide.rp_rp_hi = 0x233724
============== EVENT ============================
SR5=0x01cdb400
SP RP Return Name
0x400003ffffff1338 0x00233724 panic+0x6c
arg0: 0x0000000000b2e1c0
0x400003ffffff1298 0x00038400 fdc_target_miss_PCXU
0x400003ffffff1248 0x000378b4 fdcache_conditionally+0x90
0x400003ffffff11e8 0x001079c8 checkaccess+0x6d0
arg0: 0x0000000087638200
arg1: 0x0000000000000000
arg2: 0x800003ffbfe9b000
arg3: 0x000000000022e5a7
arg4: 0x400003ffffff0fb8
0x400003ffffff1098 0x00107be0 hdl_pfault+0x158
arg0: 0x0000000087638200
arg1: 0x0000000000000000
arg2: 0x000000000d973c00
arg3: 0x800003ffbfe9b000
arg4: 0x400003ffffff0ec0
0x400003ffffff0f48 0x001808d8 pfault+0x120
arg0: 0x0000000000000000
arg1: 0x0000000000000000
arg2: 0x000000000d973c00
arg3: 0x800003ffbfe9b8f3
0x400003ffffff0e68 0x001678cc trap+0x68c
.... --------n/a-------
arg1: 0x400003ffffff07a8
0x400003ffffff0c78 0x0016a444 thandler+0xd20
+------------- TRAP ----------------------------
| Trap type 7 in USER mode at 0xd973c00.0x800003ffbfe9b8f3 (???)
| p struct save_state 0x1cdb400.0x400003ffffff07a8
+------------- TRAP ----------------------------

Stack Traces for all other Crash events
=======================================

============== EVENT ============================
= Event #1 is TOC on CPU #7
= p crash_event_t 0xabf030
= p rpb_t 0x1fcacb0
= Using pc from pim.wide.rp_pcoq_head_hi = 0x288ca0
============== EVENT ============================
SR5=0x0cb75c00
SP RP Return Name
0x400003ffffff0e68 0x00288ca0 check_panic_loop+0x20
0x400003ffffff0e68 0x00167da4 trap+0xb64
0x400003ffffff0c78 0x0016a444 thandler+0xd20
+------------- TRAP ----------------------------
| Trap type 31 in USER mode at 0x7e1ec00.0xc004e20b (???)
| p struct save_state 0xcb75c00.0x400003ffffff07a8
+------------- TRAP ----------------------------


============== EVENT ============================
= Event #2 is TOC on CPU #4
= p crash_event_t 0xabf060
= p rpb_t 0x1fc9810
= Using pc from pim.wide.rp_pcoq_head_hi = 0x288ca0
============== EVENT ============================
SR5=0x0c2f8c00
SP RP Return Name
0x400003ffffff0e68 0x00288ca0 check_panic_loop+0x20
0x400003ffffff0e68 0x00167da4 trap+0xb64
0x400003ffffff0c78 0x0016a444 thandler+0xd20
+------------- TRAP ----------------------------
| Trap type 31 in USER mode at 0x29a400.0xe3941543 (???)
| p struct save_state 0xc2f8c00.0x400003ffffff07a8
+------------- TRAP ----------------------------


============== EVENT ============================
= Event #3 is TOC on CPU #5
= p crash_event_t 0xabf090
= p rpb_t 0x1fc9ef0
= Using pc from pim.wide.rp_pcoq_head_hi = 0x288ca0
============== EVENT ============================
SR5=0x0e307400
SP RP Return Name
0x400003ffffff0e68 0x00288ca0 check_panic_loop+0x20
0x400003ffffff0e68 0x00167da4 trap+0xb64
0x400003ffffff0c78 0x0016a444 thandler+0xd20
+------------- TRAP ----------------------------
| Trap type 31 in USER mode at 0xf623c00.0xeea6fa13 (???)
| p struct save_state 0xe307400.0x400003ffffff07a8
+------------- TRAP ----------------------------


============== EVENT ============================
= Event #4 is TOC on CPU #0
= p crash_event_t 0xabf0c0
= p rpb_t 0xab8e50
= Using pc from pim.wide.rp_pcoq_head_hi = 0x288cb0
============== EVENT ============================
SR5=0x04d97800
SP RP Return Name
0x400003ffffff1588 0x00288cb0 check_panic_loop+0x30
0x400003ffffff1588 0x00167da4 trap+0xb64
0x400003ffffff1398 0x0016a444 thandler+0xd20
+------------- TRAP ----------------------------
| Trap type 31 in KERNEL mode at 0x14ea84 (spluser+0x14)
| p struct save_state 0x4d97800.0x400003ffffff0ec8
+------------- TRAP ----------------------------
SR5=0x04d97800
SP RP Return Name
0x400003ffffff0ec8 0x0014ea84 spluser+0x14
0x400003ffffff0e58 0x0014e25c syscall+0x48c
0x400003ffffff0c78 0x00033f64 syscallinit+0x55c


============== EVENT ============================
= Event #5 is TOC on CPU #1
= p crash_event_t 0xabf0f0
= p rpb_t 0x1fc8370
= Using pc from pim.wide.rp_pcoq_head_hi = 0x288cb0
============== EVENT ============================
SR5=0x033b8000
SP RP Return Name
0x400003ffffff1588 0x00288cb0 check_panic_loop+0x30
0x400003ffffff1588 0x00167da4 trap+0xb64
0x400003ffffff1398 0x0016a444 thandler+0xd20
+------------- TRAP ----------------------------
| Trap type 31 in KERNEL mode at 0x14ea84 (spluser+0x14)
| p struct save_state 0x33b8000.0x400003ffffff0ec8
+------------- TRAP ----------------------------
SR5=0x033b8000
SP RP Return Name
0x400003ffffff0ec8 0x0014ea84 spluser+0x14
0x400003ffffff0e58 0x0014e25c syscall+0x48c
0x400003ffffff0c78 0x00033f64 syscallinit+0x55c


============== EVENT ============================
= Event #6 is TOC on CPU #2
= p crash_event_t 0xabf120
= p rpb_t 0x1fc8a50
= Using pc from pim.wide.rp_pcoq_head_hi = 0x288ca8
============== EVENT ============================
SR5=0x01b60c00
SP RP Return Name
0x400003ffffff0e68 0x00288ca8 check_panic_loop+0x28
0x400003ffffff0e68 0x00167da4 trap+0xb64
0x400003ffffff0c78 0x0016a444 thandler+0xd20
+------------- TRAP ----------------------------
| Trap type 31 in USER mode at 0xab69800.0x800003ffbfe99493 (???)
| p struct save_state 0x1b60c00.0x400003ffffff07a8
+------------- TRAP ----------------------------
......
=========================================




Cheers,
Raj.
" If u think u can , If u think u cannot , - You are always Right . "
Stan Sieler
Respected Contributor

Re: What To Check When Sys Crashes

This is odd...I posted a note on this
thread yesterday (hoping one of the
original posters would be notified :)

Today, the ITRC software sent me a note saying
that a reply was posted ... so I restart
MSIE (and FireFox) to display the thread.

Not only do I not see the reply, I don't
see my post either!

(I waited about 4 more hours and viewed the
thread again ... still don't see my post or
the reply!)

So...if the replier doesn't mind, I can be
reached at sieler@allegro.com

thanks!
Stan