General
cancel
Showing results for 
Search instead for 
Did you mean: 

How does the kernel detect a bus check HPMC?

 
SOLVED
Go to solution
mvpel
Trusted Contributor

How does the kernel detect a bus check HPMC?

We've been seeing multiple HPMC panics lately, and each time it point to some different random IO chassis and slot in the tombstone file, each of which has a third-party adapter card in it with either a static or dynamic-module driver.

So this got me wondering how the kernel detects a bus check HPMC - what is the HPMC handler watching for?
25 REPLIES 25
cnb
Honored Contributor
Solution

Re: How does the kernel detect a bus check HPMC?

Not sure if this is what you're looking for but, it is detected through Firmware -

http://www.dectrader.com/docs/set3/emr_na-c01037168-2.pdf

"A hardware crash event can be High Priority Machine Check (HPMC), Low Priority Machine Check (LPMC) or Transfer of Control (TOC). The machine checks are typically caused by hardware malfunctions or certain classes of bus errors. TOC on the other hand is usually initiated by the operator in response to system software being stuck in an error state.

When a hardware crash event occurs, the processor immediately branch to PDC entry point; PDCE_CHECK for HPMC and LPMC faults, and PDCE_TOC for TOC. *The implementation details of these PDC entry points are processor dependant.* Fundamentally they save the processorâ s state (general, control, space and interruption registers) into Processor Internal Memory (PIM). The processor then vectors back into the operating system entry points; HPMC_Vector or TOC_Vector. These entry points are defined in the IVA (Interruption Vector Table) and MEM_TOC in Page Zero respectively.
On entry into the kernel, a crash event entry is created. The operating system makes a pdc call (PDC_PIM) to read the processorâ s state information from PIM into a Restart Parameter Block (RPB). As such the RPB structure contains information pertinent to the understanding of the crash. For example, the Program Counter (PC) in the RPB would indicate what routine was executing at the time of HPMC/TOC event. Once the state has been saved, the operating system continues to dump physical memory to the dump device."

http://book.soundonair.ru/hall2/ch10lev1sec3.html

http://sequoia.ict.pwr.wroc.pl/~iro/RISC/sm/www.hp.com/acd-38.html

http://ftp.parisc-linux.org/docs/arch/pa11_acd.pdf

http://ftp.parisc-linux.org/docs/arch/parisc2.0.pdf


Rgds,

Matti_Kurkela
Honored Contributor

Re: How does the kernel detect a bus check HPMC?

As indicated by the documents linked by cnb, the HPMC is a hardware-level event.

I'll try to reduce the jargon level of cnb's quote a bit:

The HPMC handler is not "watching for" anything: the actual hardware chips on the system board do the watching. They check the ECC bits on any data that is being read from RAM, and also watch for similar data transmission error detection signals on other system buses.

If an error is detected, the piece of hardware activates a signal that triggers a "Group 1 interrupt" on the CPU(s). On the PA-RISC architecture, this is the most serious interrupt signal that exists. On Itanium, the terminology may be different but the event is equally severe.

The interrupt signal makes the CPU immediately stop what it's doing and mark its place in the CPU's internal registers designed for that purpose, and then check for instructions in a memory address defined at CPU design time. (Think of it as like a pre-arranged location for storing a building's evacuation and disaster recovery plans, as might be required in large office buildings by the National Building Code.)

When the system was booted up, the firmware initialized that memory address to point to the HPMC handler routine. First, the firmware HPMC handler does whatever model-specific things are required to get the error information from the system board chips and store it in standard format in the location where the kernel expects to find it.

Then the firmware checks a table of jump vectors the kernel has prepared in advance: "In case HPMC, TOC or other major sh*t happens". In the case of HPMC, this tells the firmware to run the kernel's HPMC handler. The HPMC handler will then output a message to the system console and execute the system panic procedures.

MK
MK
Laurent Menase
Honored Contributor

Re: How does the kernel detect a bus check HPMC?

also even if a HPMC is a hardware triggered panic, the route cause may be hardware or software:
1) it may be a timing problem with the interfaces,
2) it can also be a driver problem which access to address where it should not, and which causes the HPMC.

mvpel
Trusted Contributor

Re: How does the kernel detect a bus check HPMC?

Laurent, thanks for that insight about a possible software cause for an HPMC. Would something along those lines wind up being identified as a "bus check" HPMC?

This possibility definitely lends weight to my suspicion that a driver problem is at the root of the issue we've been having, since we've actually been able to reproduce a "Bus Check" HPMC on a different system, which suggests something other than a hardware problem.

The HPMCs we've seen will point to seemingly random LBA numbers, all of which have been third-party cards with either dynamic or static driver modules.

This has been extremely baffling for quite a while since we were always focused on some sort of hardware issue, and we had a lot of trouble reproducing or characterizing it.
Ismail Azad
Esteemed Contributor

Re: How does the kernel detect a bus check HPMC?

Hi mvpel,

> Simplified answer!

The kernel has it's own view of the operating system called as the virtual address space. This virtual address space has to be mapped to the physical address space which is typically called as the TLB. TLB is a finite registry and hence a TLB miss can occur. When a device tries to access a physical address that is actually not there, it triggers a high priority machine check or otherwise called as a machine check abort on other architectures.

Regards
Ismail Azad
Read, read and read... Then read again until you read "between the lines".....
cnb
Honored Contributor

Re: How does the kernel detect a bus check HPMC?

Hi,

Yes they can be software as per the crashinfo statement:


"Note: This appears to be a BUS check hpmc. BUS Checks are often caused by
hardware problems, but there are many software causes as well.
To progress a BUS Check HPMC you will normally need to obtain the
hardware TOMBSTONE and analyse it."

Rgds,
mvpel
Trusted Contributor

Re: How does the kernel detect a bus check HPMC?

Ismael - so it's possible for software to manipulate the TLB directly, rather than dealing only in virtual addresses and leaving the translation to someone else?
Ismail Azad
Esteemed Contributor

Re: How does the kernel detect a bus check HPMC?

Hi,

Reading your first post again, I understand that you have been experiencing multiple HPMC panics and you could probably have similar kernel configurations on the various servers (if youv'e used ignite) . The root cause could be a parameter that controls PCI recovery. Please check the value of pci_eh_enable as this MIGHT be the cause of your problem. Since you were talking about "kernel detection", pci_eh_enable is the ultimate parameter that can cause this disaster in most cases if configured wrongly.

Regards
Ismail Azad
Read, read and read... Then read again until you read "between the lines".....
mvpel
Trusted Contributor

Re: How does the kernel detect a bus check HPMC?

These systems are 11i v1, so it looks like that tunable is not implemented - it doesn't show up with kmtune or adb queries.