Re: help reading crashinfo output

Denver Osborn · ‎09-19-2007

Can anyone help me analyze this output from crashinfo?

Thanks!
-denver

Tim Nelson · ‎09-19-2007

This references a Data Page Fault. What does /var/tombston/ts99 say ? Might help to reference what happened.

Has this sytem been up and running ( I see only 21 minutes uptime ) Continuously crashing ? Any more details ?

Denver Osborn · ‎09-19-2007

It's crashed 4 times today, data page faults and one spinlock. We only managed to get one dump saved by booting to lvm maint mode and running savecrash. For some reason savecrash on the previous three panic's said there was no dump to save, even though the console log confirmed it was written... and yes, savecrash=1 and savecrash_dir=/var/adm/crash is set. :) that's a problem to solve later. Nothing in ts99 either.

I was hoping the crashinfo output would help us understand what triggered the panic. Is it the wrong tool to be using?

-denver

Tim Nelson · ‎09-19-2007

The tool is ok.. We are just waiting for a crash analysis expert to reply..

What type of hardware..

Check MP logs, ctrl-b, cm, sl

If no hw references in logs then this is probably a software panic.. Boot to single user and let it sit, boot to init 1 and let it sit. Boot to init 2 and let it sit.... If you do not get more from the forum then open a call with HP and have someone look at the crash info. I am sure they can isolate the issue much faster.

Don Morris_1 · ‎09-19-2007

Important parts:
Type 15: Data TLB Miss Fault/Data Page Fault

Interruption Instruction Register:
IIR = 0x50590000

Interruption Space and Offset Registers:
ISR.IOR = 0x0.0x800040020000

Interruption Instruction Address Queue:
PCSQ.PCOQ = 0x0.0x161994 = kfree+0xd4

Interrupt Instruction at kfree+0xd4:
ldd 0(rp),arg1

Virtual address information:

VA 0x0.0x800040020000 does not have a translation.

+------------- TRAP ----------------------------
| Trap type 15 in KERNEL mode at 0x161994 (kfree+0xd4)
| p struct save_state 0xe428400.0x400003ffffff12e8
+------------- TRAP ----------------------------
SR5=0x0e428400
SP RP Return Name
0x400003ffffff12e8 0x00161994 kfree+0xd4
0x400003ffffff1248 0x00168198 vm_release_structs+0x30
0x400003ffffff11c8 0x0011d318 vx_pagein+0xf8
0x400003ffffff10d8 0x0015ddd8 virtual_fault+0x158
0x400003ffffff0f28 0x00152394 vfault+0x144
0x400003ffffff0e58 0x0015291c trap+0x234
0x400003ffffff0c78 0x001558f0 thandler+0xd24

Ok... so we're in vx_pagein(), paging in a memory mapped file from VxFS. That gets done (succeed or fail) and it goes to free back a VM metadata structure [vm_release_structs()]. That gets freed to the M_PAGEIN arena... and things blow up.

If you have kmeminfo, kmeminfo -arena M_PAGEIN might be interesting, but in short -- you need detailed dump analysis to try to determine what the corruption was within the arena [kernel memory corruption] and how it might have happened.

I'd recommend contacting Support and providing them with the dump.

Don Morris_1 · ‎09-19-2007

Oh, and since on re-reading you also wanted more of a "Why is this happening now?" -- I should be more specific in that you can't really know just from this output. We can tell in this dump that the arena is corrupted. Usually that's because of a buffer overrun of the memory client or stale pointer usage... but as to exactly why it is happening now, what sequence in your workload causes it... I really couldn't say from just that information. This is the kernel equivalent of a SIGSEGV in libc on a free() call -- the kernel allocator metadata used to put the object back where it belongs got trashed.. so the free path blew up. Just the DPF doesn't tell you enough.

[For a little perspective -- this type of thing happens much more often in development, naturally -- and quite often the only way to track it down is to use an internal tool to force the equivalent of buffer under/overrun protection, etc. on a particular arena. By the time the dump happens, things are so mangled usually that you can't really say who did it].

(No points on this if any at all... I should have added this to the prior message).

Sameer_Nirmal · ‎09-19-2007

It looks like an indirect software panic. I think the first trap 15 in USER mode is a cause of the second trap 15 in kernel mode. Looking at the threads information, the panic seems to be caused while processes like 8393, 8394, 8395 which were spawned by unite_search.pl ? through Web server query?

Just to point out about dbc_max_pct = 50%. It doesn't seem to be related to the the panic or cause of it. Any specific reason for this setting? I hope you know the significance of it so I won't explain it in here.

Don Morris_1 · ‎09-19-2007

Um, no... a trap 15 in User space causing a page in is very, very normal. Folks call that a virtual fault, after all.

The problem is (as I stated) a corruption in kernel memory causing a bad kernel pointer dereference within the arena structures on the memory free for the VM metadata affiliated with this VxFS pagein. Without the dump, what the corruption looks like or how it happened is pretty much impossible to determine.

Denver Osborn · ‎09-20-2007

unfortunately I don't have a support contract for this box, so I won't have the dump analyzed by hp.

For now, the system was swapped out with another one and it's been stable. however I don't feel this was h/w related and the panic could happen again.

I ran q4 to generate the ana.out, trace.out and what.out files. If any of that would be of use, please let me know.

-denver

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: help reading crashinfo output

help reading crashinfo output