Operating System - HP-UX
1833740 Members
2661 Online
110063 Solutions
New Discussion

Re: repeated kernel panic "Trap Type 15 (Data page fault)"

 
David Trusty
Frequent Advisor

repeated kernel panic "Trap Type 15 (Data page fault)"

 
23 REPLIES 23
Eugeny Brychkov
Honored Contributor

Re: repeated kernel panic "Trap Type 15 (Data page fault)"

David,
reboot your server and get to PDC. Check PIM output (processor internal memory) if there's a valid timestamp and error codes logged, check MDT (memory deallocation table) if there're any memory module logged errors. If you'll find anything there call HP.
If not, try patching your system with latest GR first and then latest HW enablement from the SAME support CD (for them to have same release date). Also check if FC driver installed is the latest (you have B.11.11.06).
You may also use STM (Support Tools Manager) to check CPUs and memory and view its logs
Eugeny
Bill Hassell
Honored Contributor

Re: repeated kernel panic "Trap Type 15 (Data page fault)"

The stack trace looks like a problem with NFS. I would make sure all the NFS patches are up to date.


Bill Hassell, sysadmin
Steven E. Protter
Exalted Contributor

Re: repeated kernel panic "Trap Type 15 (Data page fault)"

Besides nfs patches, save yourself some time and get the latest Quality Pack or whatever they call it patch in.

There is another patch that may not be related by you should have anyway, because it caused kernel panics for me when reading large files from ultrium tape drives.

PHKL_27753

It was not in the last Certified Bundle I installed which was July or September.

If you have a support contract, they'll always be happy to analyze q4 output and give you precise answers. No matter how smart(assed) I think I am, I'd never want to run systems without a support contract. It's the best money you can ever invest. See my thread on great support stories.
Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
David Trusty
Frequent Advisor

Re: repeated kernel panic "Trap Type 15 (Data page fault)"

Well, I tried applying the latest Quality
and Hardware Enablement patch
bundles, as well as all the NFS
critical patches.

It still keeps happening.

Any ideas?

Eugeny Brychkov
Honored Contributor

Re: repeated kernel panic "Trap Type 15 (Data page fault)"

Check memory with STM (support tools manager) if there are single bit errors logged/memory deallocated, check latest tombstones (/var/adm/tombstones/ts99)
Eugeny
Martin Johnson
Honored Contributor

Re: repeated kernel panic "Trap Type 15 (Data page fault)"

Steven E. Protter
Exalted Contributor

Re: repeated kernel panic "Trap Type 15 (Data page fault)"

You tried PHKL_27753 ?

Wow.

All the NFS stuff too?

check dmesg, certain hardware faults can trigger a panic. I had a disk trigger what looked like I/O software panics a ways back.

Also, get support's dump team to read the q4 dump. Those guys are wizards.

Steve
Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
David Trusty
Frequent Advisor

Re: repeated kernel panic "Trap Type 15 (Data page fault)"

Thanks for all the replies!!!

There are no memory errors in the
tombstones file.

I searched the individual patches for
"data page fault" and have applied
a few more.

I had previously applied PHKL_27753.

I believe I have a later version of the
cumulative streams patch.

Here is the total list of the individual
patches I have applied over the last
few days:

# BUNDLE B.11.00 Patch Bundle
BUNDLE.PHKL_23225 1.0 Fix for dqput() data page fault panic
BUNDLE.PHCO_24777 1.0 mountall cumulative patch.
BUNDLE.PHNE_28089 1.0 cumulative ARPA Transport patch
BUNDLE.PHNE_27703 1.0 Cumulative STREAMS Patch
BUNDLE.PHNE_27218 1.0 ONC/NFS General Release/Performance Patch
BUNDLE.PHNE_25388 1.0 LAN product cumulative patch
BUNDLE.PHNE_24403 1.0 HP-PB 100Base-T.
BUNDLE.PHKL_28267 1.0 thread perf, user limit, cumulative VM
BUNDLE.PHKL_28096 1.0 SCSI IO Cumulative Patch
BUNDLE.PHKL_27830 1.0 VxFS cumulative;VxFS 3-way deadlock;sendfile
BUNDLE.PHKL_27825 1.0 Cumulative VM,Psets,Preemption,PRM
BUNDLE.PHKL_27766 1.0 early boot,Psets,vPar,Xserver,T600 HPMC KRNG
BUNDLE.PHKL_27753 1.0 audit subsystem cumulative patch
BUNDLE.PHKL_27751 1.0 Fibre Channel Mass Storage Patch
BUNDLE.PHKL_27682 1.0 diag0 cumulative patch.
BUNDLE.PHKL_27317 1.0 detach; NOSTOP, Abort; Psets; slpq1 perf
BUNDLE.PHKL_27304 1.0 SCSI Tape (stape) cumulative
BUNDLE.PHKL_27179 1.0 Corrected reference to thread register state
BUNDLE.PHKL_27172 1.0 vPars panic; Syscall cumulative
BUNDLE.PHKL_27152 1.0 I/O Cumulative, PA 8700 2.2, vPar, PCI-X
BUNDLE.PHKL_27096 1.0 VxVM,EMC,Psets&vPar,slpq1,earlyKRS
BUNDLE.PHKL_27094 1.0 Psets Enablement Patch, slpq1 perf
BUNDLE.PHKL_27091 1.0 Core PM, vPar, Psets Cumulative, slpq1 perf
BUNDLE.PHKL_27025 1.0 SCSI Ultra160 Driver with OLAR support
BUNDLE.PHKL_26698 1.0 umount-mkfs panic; HFS mount/umount perf
BUNDLE.PHKL_26032 1.0 New audio h/w support + cumulative fixes
BUNDLE.PHKL_25729 1.0 signals,threads enhancement,Psets Enablement
BUNDLE.PHKL_25602 1.0 Fix panic in ccio_alloc_shared_mem
BUNDLE.PHKL_25233 1.0 select(2) and poll(2) hang
BUNDLE.PHKL_24507 1.0 fix for data page fault in pstat_getstream()
BUNDLE.PHKL_24343 1.0 Data Page Fault panic in DNLC
BUNDLE.PHKL_23957 1.0 Boot panic (w/Fiber Ch. & Gig. Ethernet) fix

I just added these:
BUNDLE.PHNE_28089 1.0 cumulative ARPA Transport patch
BUNDLE.PHKL_28267 1.0 thread perf, user limit, cumulative VM

Is there anything left to try before
sending the dump to support?
James Murtagh
Honored Contributor

Re: repeated kernel panic "Trap Type 15 (Data page fault)"

Hi David,

If you have quite a few dumps you can narrow it down a bit. Using q4 trace the crash events for about three dumps. i.e. :

q4> trace event 0

If the stack trace are all different then you are probably looking at hardware. If you run the following q4 commands on each of the dumps you may see a pattern on a particular processor :

q4> load mpinfo_t from mpproc_info max nmpinfo
q4> trace pile
processor 0 was running process at 0x580f00 (pid 794)
stack trace for event 1
crash event was a panic
.....
.....

processor 1 claims to be idle
stack trace for event 2
crash event was a TOC

The panic on one processor will send a TOC to the others. If you see a pattern then most likely one of the registers on that processors is on the way out. If not, log a call and we can examine the dump in greater detail.

Regards,

James.
David Trusty
Frequent Advisor

Re: repeated kernel panic "Trap Type 15 (Data page fault)"

I ran q4 on two of the latest dumps.

They are always at the same point:

q4> trace event 0
stack trace for event 0
crash event was a panic
panic+0x6c
report_trap_or_int_and_panic+0x94
trap+0xedc
nokgdb+0x8
bcopy_pcxu_method+0x4
xdrmblk_getbytes+0x5c
xdr_opaque+0x78
xdr_bytes+0xd4
xdr_READ3resok+0x74
xdr_READ3res+0x38
xdr_replymsg+0xd4
clnt_clts_kcallit_addr+0x570
clnt_clts_kcallit+0x28
rfscall+0x27c
rfs3call+0x78
nfs3read+0x100
nfs3_do_bio+0x108
async_daemon+0x4ec
coerce_scall_args+0xe0
syscall+0x204
$syscallrtn+0x0

Maybe this is just a bug in NFS version 3.
Does it look that way to you?

melvyn burnard
Honored Contributor

Re: repeated kernel panic "Trap Type 15 (Data page fault)"

As you are having multiple panics, I would strongly recommend you get the dumps analysed properly by your local HP Response Centre.
You may have a corner case, or have found a new bug -}
My house is the bank's, my money the wife's, But my opinions belong to me, not HP!
James Murtagh
Honored Contributor

Re: repeated kernel panic "Trap Type 15 (Data page fault)"

Hi David,

I found an almost exact match of you panic, however a successor to the patch that fixed it is already on your system, so you should probably get the dump in to confirm.

Just one final question though....what kind of server(s) are you connecting with via NFS?

regards,

James.
David Trusty
Frequent Advisor

Re: repeated kernel panic "Trap Type 15 (Data page fault)"

We are connecting to Linux servers via NFS.

What patch was the one
which looked closest to the
symptoms?


James Murtagh
Honored Contributor

Re: repeated kernel panic "Trap Type 15 (Data page fault)"

Hi David,

Patch PHNE_23502 was the one that fixed the previous issue and the latest successor is:

PHNE_27218 1.0 ONC/NFS General Release/Performance Patch

which you already have installed. I imagine the full trace was not put in the patch text as it is so large anyway, however it was the same stack trace. See JAGad35150 in the patch text for a description.

If you are connecting to a Linux system it will probably be a new issue so I imagine the labs will be keen to see the dump.

regards,

James.
Bill Hassell
Honored Contributor

Re: repeated kernel panic "Trap Type 15 (Data page fault)"

As a suggestion: NFS ver3 is the new kid on the block. Setup your server to not use NFS ver3 for now and see if the problem disappears.


Bill Hassell, sysadmin
David Trusty
Frequent Advisor

Re: repeated kernel panic "Trap Type 15 (Data page fault)"

I tried going back to NFS version 2.

What is happening now is
a lockup in the processes
which read from NFS. They
are permanently sleeping.

I also saw some of this
happen with version 3, but
was more concerned with the
panics.

How can I diagnose these
processes which are hung
(apparently in NFS reads)?

Please help. These NFS
problems are real show-stoppers.

Thanks,

David
David Trusty
Frequent Advisor

Re: repeated kernel panic "Trap Type 15 (Data page fault)"

 
James Murtagh
Honored Contributor

Re: repeated kernel panic "Trap Type 15 (Data page fault)"

Hi David,

I see what you mean. Sorry to ask so many questions when you are trying to solve this one but....

Have you tried using udp?
Are there any errors in the message buffer/syslog?
Does this only happen when the client/server is linux?
You said this happened to at version 3 - where the symptomns as severe?

I suppose you could try tracing the processes in q4 to....

# ied q4 /stand/vmunix /dev/kmem
q4> load proc_t from proc_list next p_factp max nproc
q4> keep p_pid ==
q4> trace pile

Also, i'm sure you know about this document on NFS performance tuning but here is the link anyway.

http://www.docs.hp.com/hpux/onlinedocs/1435/NFSPerformanceTuninginHP-UX11.0and11iSystems.pdf

regards,

James.
David Trusty
Frequent Advisor

Re: repeated kernel panic "Trap Type 15 (Data page fault)"

Thanks for helping me with this.

Here is what I know at this time.
The problem may not be due to
NFS, but rather to TCP. The TCP
connection which appears to be
hung, is between two processes
which send backup data to tape. It
is not an NFS-TCP connection.

I've tried using UDP for the NFS
connections using version 3, but
that still took the panic, so I changed
the version back to 2. All the open
NFS files on the system now are
accessed by UDP connections.

There are no error messages in
the syslog. This is so strange.


Here is what I get from the q4
analysis of both the 'read' and the
'write' processes:

q4> keep p_pid == 3549
kept 1 of 181 proc_t's, discarded 180
q4> trace pile
stack trace for process at 0x0`4c2b5040 (pid 3549), thread at 0x0`4c2b6040 (tid 3718)
process was not running on any processor
_swtch+0xc4
_sleep+0x4e0
read_sleep+0x1b8
hpstreams_read_int+0x1e8
streams_read_uio+0x28
soreceive+0x3ec
soo_rw+0x40
read+0x10c
syscall+0x204
$syscallrtn+0x0
q4> load proc_t from proc_list next p_factp max nproc
loaded 180 proc_ts as a linked list (stopped by null pointer)
q4> keep p_pid == 3550
kept 1 of 180 proc_t's, discarded 179
q4> trace pile
stack trace for process at 0x0`5022f040 (pid 3550), thread at 0x0`4c2cb040 (tid 3719)
process was not running on any processor
_swtch+0xc4
_sleep+0x4e0
write_sleep+0x184
streams_write_uio+0x3bc
sosend+0x4d4
soo_rw+0x80
write+0x108
syscall+0x204
$syscallrtn+0x0

Is there anything else I can gather?



James Murtagh
Honored Contributor

Re: repeated kernel panic "Trap Type 15 (Data page fault)"

Hi David,

Not sure if I can be much help, not exactly my area of expertise! Suppose we should look at the streams subsystem for any errors though....

In the directory /var/adm/streams there is an error log of the notation error.dd.mm, is there any errors reported? Also see the streams binaries in /usr/bin :

# ll /usr/bin/str*
-r-xr-xr-x 1 bin bin 16384 Nov 14 2000 /usr/bin/strace
-r-xr-xr-x 1 bin bin 20480 Nov 14 2000 /usr/bin/strchg
-r-xr-xr-x 1 bin bin 16384 Nov 14 2000 /usr/bin/strclean
-r-xr-xr-x 1 bin bin 16384 Nov 14 2000 /usr/bin/strconf
-r-xr-xr-x 1 bin sys 118784 Nov 14 2000 /usr/bin/strdb
-r-xr-xr-x 1 bin bin 16384 Nov 14 2000 /usr/bin/strerr
-r-xr-xr-x 1 bin bin 20480 Nov 14 2000 /usr/bin/strvf

strvf verifies the streams installation, the others have manpages that describe their use.

Although I'm not sure this is the issue, I'm reading some streams internals notes and will see what I can find.

Regards,

James.
David Trusty
Frequent Advisor

Re: repeated kernel panic "Trap Type 15 (Data page fault)"

Thanks for helping. I really appreciate
it!

There are no error messages in
the streams log directory.

With netstat I can see some "packet"
statistics occasionally increase on the
loopback interface (lo0), but I have
not found any way to trace contents
of the packets. Tcpdump complains
"no such device /dev/lo0" when I
try to trace them.

I think you are on the right track with
looking at the streams queues.
If we can also see the tcp packets
on lo0, then perhaps we can narrow
it down a bit...
James Murtagh
Honored Contributor

Re: repeated kernel panic "Trap Type 15 (Data page fault)"

Hi David,

Attached are some tools written by HP labs. Hopefully I won't be in trouble by posting these! :) If you can run each one and attach the output?

The three tools are:

strshow
tcpipstreams
crashinfo

The first two should be obvious, the third gives a general view of the sysyem. Normally used on dumps (you can run it on your dumps by changing directory to the dump dir and executing). These are normally put in /usr/contrib/bin and can be run without any options.

Regards,

James.
James Murtagh
Honored Contributor

Re: repeated kernel panic "Trap Type 15 (Data page fault)"

Hi David,

Just realised you can't attach binaries.....personal email is james@carrera-blue.com if you want them.

Cheers,

James.