ProLiant Servers (ML,DL,SL)
1843154 Members
2935 Online
110214 Solutions
New Discussion

DL360 G10 - (CRITICAL) Uncorrectable Machine Check Exception Crash

 
TomJ802
Frequent Advisor

DL360 G10 - (CRITICAL) Uncorrectable Machine Check Exception Crash

Yesterday my 'new' DL360 G10 crashed with the following error:

HPE iLO AlertMail-006: (CRITICAL) Uncorrectable Machine Check Exception (Processor 1, APIC ID 0x00000000, Bank 0x00000004, Status 0xBA000000'58000402, Address 0x00000000'00000000, Misc 0x00000000'00000000).

The system rebooted, failed, seemed to reset some bios values, rebooted and is now running.  But obviously confidence has been severely shaken waiting for it to happen again.

Did a lot of Googling, the 'obvious' response seems to be bad cpu.  But hardware tests all pass, so others claim firmware issue.  And when was the last time you actually had a cpu fail?  Seems unlikely to me.

Coindidentally, I found this: advisery - Advisory: (Revision) HPE NVMe Solid State Drives - SSDs NVMe Models with Firmware Version MPK77H5Q, MPK7725Q, or HPK5 May Cause UMCEs on AMD and Intel-Based Gen9, Gen10, Gen10 Plus or Gen11 Servers

Certain NVMe drive models with firmware version MPK77H5Q, MPK7725Q or HPK5 may cause an Uncorrectable Machine Check Exception (UMCE) to occur. An uncorrectable PCIe bus error may also be logged in the case of Intel-based servers. These errors will be logged to the HPE Integrated Management Log (IML) and will cause the operating system to crash. The error may also cause an unexpected server reboot event.

Critical,1240,172348,0x0005,CPU,0x0003,Hardware,06/12/2024 12:05:43,40118: Uncorrectable Machine Check Exception (Processor 2, APIC ID 0x00000020, Bank 0x00000006, Status 0xBB800000'00000E0B, Address 0x00000000'00000000, Misc 0x00000000'AE100000). ACTION: Update the system firmware. If the issue persists, contact support.

I happen to have (4) 800GB SAS MO000800KXAVN drives in my system with showing HPK5 firmware.  But I'll be darned trying to find this drive on the HPE website, let alone firmware for it.  I did apply SPP 2025/09 in early October, which appears to be the latest SPP available.  And I do not see any other firmware warnings in my HPE support account.

Any suggestions?  Or where to locate FW for my MO000800KXAVN drives?

THANKS

4 REPLIES 4
TomJ802
Frequent Advisor

Re: DL360 G10 - (CRITICAL) Uncorrectable Machine Check Exception Crash

Google reveals that I am not alone on this issue - Solved: DL380 Gen10 Uncorrectable PCI Express Error Detec... - Hewlett Packard Enterprise Community

While that thread is marked 'resolved', I'm not certain that is true - I am running more recent FW than the version that apparently 'resolved' it 6 years ago...

Does anyone know how to identify the components associated with Bank 0x04 ?

 

Azr_geek
Regular Advisor

Re: DL360 G10 - (CRITICAL) Uncorrectable Machine Check Exception Crash

Hello @TomJ802,

On HPE ProLiant servers like the DL380 Gen10, a message referring to “Bank 0x04” usually points to a specific internal error-logging bank inside the system’s CPU or memory controller, not to a physical slot that is labeled the same way.

In simple terms, it means the hardware recorded an error in a particular error bank, and you need the server’s management tools to translate that into the actual component. The best way to identify the real part behind Bank 0x04 is to check the Integrated Management Log (IML) in iLO, run an HPE Active Health System (AHS) report, or use HPE Support Tools like SSA or STP.

These tools map the bank number to a PCIe device, DIMM, or CPU lane. Without that mapping, Bank 0x04 alone isn’t enough to know the component. If the error keeps repeating, generating an AHS log and giving it to HPE support is usually the fastest way to pinpoint the exact card, slot, or controller causing the PCIe error.

Regards,
Azr_geek

TomJ802
Frequent Advisor

Re: DL360 G10 - (CRITICAL) Uncorrectable Machine Check Exception Crash

Thanks for the reply and the information.  While I have the iLO error email, I inadvertantly deleted the IML entries - during the crash cycle it appears the bios reset itself and the most recent entry was a tamper warning for the server lid...   The GUI had check boxes to the left of every entry, so I checked that one entry and clicked the delete icon - none of the other check boxes for the error entries were checked.  Imagine my surprise (and frustration) when that action deleted the entire IML log.  That may be user error, but I will say the GUI is not obvious in this feature.

Hopefully this crash does not happen again - though I have never encountered this with any of my older Lenovo servers and my confidence has been severely shaken...  We recently purchased this DL360 for a business critical application and now I am losing sleep and checking the server status first thing every morning...

Tom

 

Azr_geek
Regular Advisor

Re: DL360 G10 - (CRITICAL) Uncorrectable Machine Check Exception Crash

Losing the IML entries after a critical crash would unsettle anyone, especially when the server is running a business-critical workload. Unfortunately, the iLO IML delete function does wipe the entire log, even if only one box is selected. You didn’t do anything wrong — the interface really is not very clear about that, and many admins have been caught by the same behavior.

The AHS (Active Health System) keeps deeper, time-stamped hardware telemetry that survives BIOS resets and IML deletions. If this issue happens again, the AHS file will give HPE support the best chance of pinpointing the exact root cause.

A single crash after a firmware reset isn’t always a sign of ongoing failure. Sometimes a one-off PCIe machine check comes from a momentary glitch, power fluctuation, or firmware state issue, especially around updates or resets. If the system is now stable, there’s a good chance it was an isolated event.

Your hardware is still under support HPE will normally treat repeat MCE/PCIe errors very seriously. If it happens again, they can analyze the AHS logs and often identify a specific card, riser, or system board. In some cases they’ll replace parts proactively.

You’re not wrong to be worried — anyone would be — but one crash doesn’t automatically mean you have an unreliable system. If something is truly faulty, the server will usually show repeatable symptoms, and HPE is very familiar with tracking these down.

If you want, I can also summarize what to collect or check now, so that if the issue returns you have everything ready for support.

Regards,
Azr_geek