FYI: Newly-identified HP-UX 11i v1 kernel bug

mvpel · ‎11-20-2010

Working with the HP WTEC over the past five months or so, we've just this week gotten to the bottom of a perplexing and vexing series of data page fault panics that usually occurred at most once a month, on only a small handful of our systems.

The panics occurred in psl_search with an invalid pointer dereference.

From the initial crash dumps it was possible to determine that the list of memory regions (pregions) in a virtual address space associated with the iomappers table had been corrupted, but it wasn't possible to tell where the corruption had originated- in a third-party driver or in the kernel itself.

Things would hum along fine for a while, minutes, or hours, or possibly even days, but once that free VAS was allocated to something else and reinitialized, the next time the iomappers list was scanned and the pregion skip list for that one VAS was searched, the next link in the list went off to 0x00, resulting in a data page fault panic.

The first round of debug code captured timestamps and the stack trace of the creator of each VAS. All this looked normal and the new VAS was created properly, so it wasn't possible in that crash dump to tell where the corruption originated - only how long ago the VAS which caused the panic had been freed. In the first debugged crash, it was about 3 minutes earlier, so the culprit was long gone.

The next debug code modified the freevas() call and was set up panic the system whenever an attempt was made to free a VAS that was still part of the iomappers list.

Unfortunately the first crash that occurred about a month after this debug code was provided was on a system that I had overlooked modifying the LIF AUTO file, so during a subsequent reboot it had reverted to the old debug kernel. Needless to say I fixed that problem immediately, grumbling the whole time.

One month and two days later, the debug-induced panic finally happened, and it nailed the bug right to the wall.

WTEC was able to identify two missing lines of code in the back-out from an ENOMEM error leg during the duplication of a VAS in a fork() system call. The duplicated iomapper entry was created during the dupvas procedure, but was not removed during the backout from the error condition.

The problem has been referred to the support lab to go through the patch development process.

So, if you have a data page fault panic which tracks back to psl_search, ask about this case.

Another moral of the story - make sure that your system is configured with enough crash dump space, and enough space in /var/adm/crash or wherever you set it in /etc/rc.config.d/savecrash, to store a full crash dump, even if you have to buy more disks. It would have been impossible to find this problem if it weren't for those steps.

Shibin_2 · ‎11-21-2010

Oh man!! We have few HPUX 11iv1 MCCOE servers scheduled to patch by Jan. These are running for 800+ days and had never patched after 2005.

I hope the patching won't create such hiccups.

Regards
Shibin

mvpel · ‎11-23-2010

I'm actually not sure how long this problem has existed. I'll be curious to find out. It doesn't exist in 11.23 because the code in question was completely rewritten.

Here's the stack trace of a typical crash dump resulting from the iomapper entry which was earlier corrupted:

panic+0xa0
report_trap_or_int_and_panic+0x94
trap+0xef8
thandler+0xd24
psl_search+0x4
findpreg+0x74
iomap_search+0x38
io_map_internal+0x1f8
kernel_iomap+0x38
iomemrw+0x180
iomem_read+0x14
spec_rdwr+0x10c
vno_rw+0x1ac
read+0x184
syscall+0x204
syscallinit+0x55c

The "psl_search" call is searching the "pregion skip list" while searching for an iomapper.

mvpel · ‎01-11-2011

I'm told the patch release to fix this bug has been scheduled for February 20.

mvpel · ‎01-11-2011

I forgot to mention that another manifestation of this bug is a spinlock deadlock panic on the vas_h_sl_pool, though this will be a bit less common since the VAS hash spinlock pool has 128 locks availble and two VAS addresses have to hash to the same spinlock at exactly the right time for this to occur.

mvpel · ‎02-28-2011

The patch for this bug is released:

Patch Name: PHKL_41910
Patch Description: s700_800 11.11 IO mapping
Creation Date: 11/02/22
Post Date: 11/02/25
Hardware Platforms - OS Releases:
s700: 11.11
s800: 11.11
Products: N/A
Filesets:
ProgSupport.PAUX-ENG-A-MAN
OS-Core.CORE2-KRN
OS-Core.CORE2-KRN
Automatic Reboot?: Yes
Status: General Release

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

FYI: Newly-identified HP-UX 11i v1 kernel bug

FYI: Newly-identified HP-UX 11i v1 kernel bug

Re: FYI: Newly-identified HP-UX 11i v1 kernel bug

Re: FYI: Newly-identified HP-UX 11i v1 kernel bug

Re: FYI: Newly-identified HP-UX 11i v1 kernel bug

Re: FYI: Newly-identified HP-UX 11i v1 kernel bug

Re: FYI: Newly-identified HP-UX 11i v1 kernel bug