General
cancel
Showing results for 
Search instead for 
Did you mean: 

How does the kernel detect a bus check HPMC?

SOLVED
Go to solution
mvpel
Trusted Contributor

How does the kernel detect a bus check HPMC?

We've been seeing multiple HPMC panics lately, and each time it point to some different random IO chassis and slot in the tombstone file, each of which has a third-party adapter card in it with either a static or dynamic-module driver.

So this got me wondering how the kernel detects a bus check HPMC - what is the HPMC handler watching for?
25 REPLIES
cnb
Honored Contributor
Solution

Re: How does the kernel detect a bus check HPMC?

Not sure if this is what you're looking for but, it is detected through Firmware -

http://www.dectrader.com/docs/set3/emr_na-c01037168-2.pdf

"A hardware crash event can be High Priority Machine Check (HPMC), Low Priority Machine Check (LPMC) or Transfer of Control (TOC). The machine checks are typically caused by hardware malfunctions or certain classes of bus errors. TOC on the other hand is usually initiated by the operator in response to system software being stuck in an error state.

When a hardware crash event occurs, the processor immediately branch to PDC entry point; PDCE_CHECK for HPMC and LPMC faults, and PDCE_TOC for TOC. *The implementation details of these PDC entry points are processor dependant.* Fundamentally they save the processorâ s state (general, control, space and interruption registers) into Processor Internal Memory (PIM). The processor then vectors back into the operating system entry points; HPMC_Vector or TOC_Vector. These entry points are defined in the IVA (Interruption Vector Table) and MEM_TOC in Page Zero respectively.
On entry into the kernel, a crash event entry is created. The operating system makes a pdc call (PDC_PIM) to read the processorâ s state information from PIM into a Restart Parameter Block (RPB). As such the RPB structure contains information pertinent to the understanding of the crash. For example, the Program Counter (PC) in the RPB would indicate what routine was executing at the time of HPMC/TOC event. Once the state has been saved, the operating system continues to dump physical memory to the dump device."

http://book.soundonair.ru/hall2/ch10lev1sec3.html

http://sequoia.ict.pwr.wroc.pl/~iro/RISC/sm/www.hp.com/acd-38.html

http://ftp.parisc-linux.org/docs/arch/pa11_acd.pdf

http://ftp.parisc-linux.org/docs/arch/parisc2.0.pdf


Rgds,

Matti_Kurkela
Honored Contributor

Re: How does the kernel detect a bus check HPMC?

As indicated by the documents linked by cnb, the HPMC is a hardware-level event.

I'll try to reduce the jargon level of cnb's quote a bit:

The HPMC handler is not "watching for" anything: the actual hardware chips on the system board do the watching. They check the ECC bits on any data that is being read from RAM, and also watch for similar data transmission error detection signals on other system buses.

If an error is detected, the piece of hardware activates a signal that triggers a "Group 1 interrupt" on the CPU(s). On the PA-RISC architecture, this is the most serious interrupt signal that exists. On Itanium, the terminology may be different but the event is equally severe.

The interrupt signal makes the CPU immediately stop what it's doing and mark its place in the CPU's internal registers designed for that purpose, and then check for instructions in a memory address defined at CPU design time. (Think of it as like a pre-arranged location for storing a building's evacuation and disaster recovery plans, as might be required in large office buildings by the National Building Code.)

When the system was booted up, the firmware initialized that memory address to point to the HPMC handler routine. First, the firmware HPMC handler does whatever model-specific things are required to get the error information from the system board chips and store it in standard format in the location where the kernel expects to find it.

Then the firmware checks a table of jump vectors the kernel has prepared in advance: "In case HPMC, TOC or other major sh*t happens". In the case of HPMC, this tells the firmware to run the kernel's HPMC handler. The HPMC handler will then output a message to the system console and execute the system panic procedures.

MK
MK
Laurent Menase
Honored Contributor

Re: How does the kernel detect a bus check HPMC?

also even if a HPMC is a hardware triggered panic, the route cause may be hardware or software:
1) it may be a timing problem with the interfaces,
2) it can also be a driver problem which access to address where it should not, and which causes the HPMC.

mvpel
Trusted Contributor

Re: How does the kernel detect a bus check HPMC?

Laurent, thanks for that insight about a possible software cause for an HPMC. Would something along those lines wind up being identified as a "bus check" HPMC?

This possibility definitely lends weight to my suspicion that a driver problem is at the root of the issue we've been having, since we've actually been able to reproduce a "Bus Check" HPMC on a different system, which suggests something other than a hardware problem.

The HPMCs we've seen will point to seemingly random LBA numbers, all of which have been third-party cards with either dynamic or static driver modules.

This has been extremely baffling for quite a while since we were always focused on some sort of hardware issue, and we had a lot of trouble reproducing or characterizing it.
Ismail Azad
Esteemed Contributor

Re: How does the kernel detect a bus check HPMC?

Hi mvpel,

> Simplified answer!

The kernel has it's own view of the operating system called as the virtual address space. This virtual address space has to be mapped to the physical address space which is typically called as the TLB. TLB is a finite registry and hence a TLB miss can occur. When a device tries to access a physical address that is actually not there, it triggers a high priority machine check or otherwise called as a machine check abort on other architectures.

Regards
Ismail Azad
Read, read and read... Then read again until you read "between the lines".....
cnb
Honored Contributor

Re: How does the kernel detect a bus check HPMC?

Hi,

Yes they can be software as per the crashinfo statement:


"Note: This appears to be a BUS check hpmc. BUS Checks are often caused by
hardware problems, but there are many software causes as well.
To progress a BUS Check HPMC you will normally need to obtain the
hardware TOMBSTONE and analyse it."

Rgds,
mvpel
Trusted Contributor

Re: How does the kernel detect a bus check HPMC?

Ismael - so it's possible for software to manipulate the TLB directly, rather than dealing only in virtual addresses and leaving the translation to someone else?
Ismail Azad
Esteemed Contributor

Re: How does the kernel detect a bus check HPMC?

Hi,

Reading your first post again, I understand that you have been experiencing multiple HPMC panics and you could probably have similar kernel configurations on the various servers (if youv'e used ignite) . The root cause could be a parameter that controls PCI recovery. Please check the value of pci_eh_enable as this MIGHT be the cause of your problem. Since you were talking about "kernel detection", pci_eh_enable is the ultimate parameter that can cause this disaster in most cases if configured wrongly.

Regards
Ismail Azad
Read, read and read... Then read again until you read "between the lines".....
mvpel
Trusted Contributor

Re: How does the kernel detect a bus check HPMC?

These systems are 11i v1, so it looks like that tunable is not implemented - it doesn't show up with kmtune or adb queries.
mvpel
Trusted Contributor

Re: How does the kernel detect a bus check HPMC?

CNB: Yeah, I saw that wording in the crashinfo output, but whenever we ran the ts99 through MCA it spit out some rope number with the replacement recommendations, and HP support stopped there.

But after the second hardware replacement, we got to wondering about software causes, so I wanted to get a better handle on how things work down at this level so we can better characterize the problem.
cnb
Honored Contributor

Re: How does the kernel detect a bus check HPMC?

Note: The PCI EH functionality is not supported on HP-UX 11i v1 OS.

http://bizsupport1.austin.hp.com/bc/docs/support/SupportManual/c02542174/c02542174.pdf

Rgds,

cnb
Honored Contributor

Re: How does the kernel detect a bus check HPMC?

Yep.

I've seen *many* HPMC crashes resolved with O/S patches, Drivers and Firmware so don't overlook these when troubleshooting BUS_CHECK HPMC issues.


Rgds,
mvpel
Trusted Contributor

Re: How does the kernel detect a bus check HPMC?

Thanks for suggesting that! After a bit of rummaging, I found this in PHNE_27400:

----
( SR:8606287203 CR:JAGae51142 )
In some situations, the driver was posting an incorrect buffer address to the card causing a HPMC.

Resolution:
Driver has been modified to handle this case correctly.
----

So we'll start pulling on this thread and see where it leads us.

This description also helps me visualize how a driver could cause an HPMC, which I wasn't very clear on with all the emphasis on hardware.
Laurent Menase
Honored Contributor

Re: How does the kernel detect a bus check HPMC?

you probably mean PHNE_28799

it is possible, but if it reoccurs you'll probably need to have the crash dump analyzed.
Dennis Handly
Acclaimed Contributor

Re: How does the kernel detect a bus check HPMC?

>so it's possible for software to manipulate the TLB directly

Only the kernel can do this with privileged instructions.
mvpel
Trusted Contributor

Re: How does the kernel detect a bus check HPMC?

We do have a crash dump, finally, and I've been doing a bit of digging. We'll see what turns up.
mvpel
Trusted Contributor

Re: How does the kernel detect a bus check HPMC?

Not much turned up, unfortunately. The crash dump suggested a hardware problem, as does the TS99 file, but on further investigation it turns out there's been a number of other instances of this type of crash on a variety of different systems, and it seems unlikely that all of them have a hardware problem.
Laurent Menase
Honored Contributor

Re: How does the kernel detect a bus check HPMC?

so we are back on the need to have a true crash dump analyzis

are the panic stack all the same?
mvpel
Trusted Contributor

Re: How does the kernel detect a bus check HPMC?

We finally peeled enough of the onion to determine that the HPMC is caused by a reference to a page of shared memory mapped to the card's registers by the driver outside the range of the card's register space.

Per the PCI 2.1 spec, a card is not required to assert the device-selected pin to PCI bus address assertions outside its accepted range of registers, and the cards in question apparently don't.

We were able to reproduce the HPMC on Monday by poking the card's mapped IO memory in the wrong place.
cnb
Honored Contributor

Re: How does the kernel detect a bus check HPMC?

Excellent research and findings!

Looks like its time to update drivers and patches :-)


Best Regards,

-cnb

Michael_Pelleti
Occasional Advisor

Re: How does the kernel detect a bus check HPMC?

We discovered, not too long after my last post, that the invalid memory access originated from the use of uninitialized data - the prev_r19 member of the previous frame descriptor structure - by the HP LIBCL library in U_get_previous_frame_x().

 

This function called U_get_shLib_text_addr() with the invalid address, which then executed a PROBE,R and a LDW instruction on the address.  If the address found in the uninitialized prev_r19 location happened to fall in the unused range of a public-mapped PCI IO page, the HPMC would occur because the PROBE,R would return true due to it being a public IO map, and the out-of-range LDW would be ignored by the PCI card, and the HPMC would be thrown six PCI clock cycles later.

 

The libcl library stack-unwind component is used extensively in Ada and C++ programs, as part of the exception handling mechanism, so every time an exception was raised in the code the runtime library ran the risk of crashing the system.

 

If the previous-frame structure memory had been a calloc() instead of malloc(), there would have been no problem, since the PROBE,R instruction would return false for a null pointer.

 

The fix, as I understand it, was to add one line of code when entering U_get_previous_frame():

previous_frame->prev_r19 = 0;

 

I'm not sure whether it's been released as an official patch yet, they were telling us November last we talked to them.

 

So, this is my third HP-UX 11.11 bug-squash in one year - PHKL_41910, PHKL_42072, and this one. Not bad, eh? :)

Dennis Handly
Acclaimed Contributor

Re: How does the kernel detect a bus check HPMC?

It looks like you have found a flaw in the kernel or your driver.

mvpel
Trusted Contributor

Re: How does the kernel detect a bus check HPMC?

The flaw was indeed in the LIBCL library. It turns out that the fix was not included in PHSS_42247, since it affected PA-RISC 1.1 Pascal-language stack-unwinding operations, and the problem is a very far-corner case. So I guess we'll be using the site-specific patch for the forseeable future.

 

 

 

Dennis Handly
Acclaimed Contributor

Re: How does the kernel detect a bus check HPMC?

>The flaw was indeed in the LIBCL library

 

If a user mode program can crash the box, it is a problem in the kernel or driver.

If that is fixed, the only "flaw" is one of performance.