ProLiant Servers (ML,DL,SL)
cancel
Showing results for 
Search instead for 
Did you mean: 

DL380G7 Uncorrectable Machine Check Exception

 
M. Meckel
Occasional Advisor

DL380G7 Uncorrectable Machine Check Exception

Hi there,

i deployed a new DL380 G7 with the following specs:

2x Xeon X5650 CPU (2.66 MHz), 6/6 cores; 12 threads
8x 8192 MB RAM 1333 MHz
1x Embedded P410i with 1GB FBWC

Firmware:

BIOS: 12/01/2010
iLo3: 1.16
P410i: 3.66

OS: Debian Squeeze
Kernel: 2.6.32-5-amd64

BIOS Setting for Power-Saving was set to "OS Control mode" and on Debian the package cpufrequtils was installed (which sets the CPU scheduler to "ondemand" for all CPUs).

While running some tests the box suddenly crashed hard:

http://test.thermoman.de/images/hp/dl380g7.kernel.panic.png

Integrated Management Log says:

Class: System Error
Description: An Unrecoverable System Error (NMI) has occurred (System error code 0x00000000, 0x00000000)

Class: CPU
Description: Uncorrectable Machine Check Exception (Board 0, Processor 2, APIC ID 0x00000020, Bank 0x00000005, Status 0xBA000000'00400405, Address 0x00000000'00000000, Misc 0x00000000'00004100)

Class: CPU
Description: Uncorrectable Machine Check Exception (Board 0, Processor 2, APIC ID 0x00000021, Bank 0x00000005, Status 0xBA000000'00400405, Address 0x00000000'00000000, Misc 0x00000000'00004100)

See http://test.thermoman.de/images/hp/dl380g7.ilo.iml.png

I googled this error and found some threads here on HP IT Resource Center regarding a bug with 2 NICs being enabled for PXE (not the case) and others suggesting problem with system board or CPU.

Since i didn't find the mentioned Numbers (Status 0xBA000000'00400405) anywhere on the web i thought post it here for other lost souls :)

Solution?

1. upgraded BIOS Firmware to version 01/30/2011
2. memtest86+ - Result: no errors found
3. disabled cpufrequtils on Debian so CPUs don't get clocked down for power saving
4. running stress test at the moment, no definite results yet.

Can someone tell me what part is being referenced by the IML status codes above? Is it CPU #2 that is detected as being faulty?

Thanks in advance!

Greetings,
Marcel.
17 REPLIES 17
James Kennedy_4
Trusted Contributor

Re: DL380G7 Uncorrectable Machine Check Exception

We are having this same issue with one of our DL380 G7s. I read through the fixes on the other thread as well, but none of them resolved the problem.

All firmware and drivers are up to date.

It appears to be a hardware problem though, as not all of our DL380 G7s have this issue.

I'll let you know if I come upon a valid fix.
M. Meckel
Occasional Advisor

Re: DL380G7 Uncorrectable Machine Check Exception

The System Board now gets replaced after the machine hung itself again even with newest BIOS installed (01/30/2011).

I'll let you know if the swap fixes the problem.
M. Meckel
Occasional Advisor

Re: DL380G7 Uncorrectable Machine Check Exception

System board got replaced. I upgraded BIOS firmware to the latest available (01/30/2011) and did run my stress tests again.

Result after 24 hours: machine crashed again.

Integrated Management Log says:

Class: System Error
Description: An Unrecoverable System Error (NMI) has occurred (System error code 0x00000000, 0x00000000)

Kernel Panic output looks the same as the above linked image.

BIOS Setting for Power-Saving was set to "OS Control mode" and the package cpufrequtils this time was NOT installed.

I'll now for the rest of the weekend try with "HP Static High Performance Mode" (as suggested in some thread as workaround from HP).
Glen Coghlan
Occasional Visitor

Re: DL380G7 Uncorrectable Machine Check Exception

Having same issues here in Australia with multiple DL 380 G7's running WIndows Server 2008 R2 SP1 with all latest bios fixes. Have logged a support case with HP. Will reply back with outcome.

James Kennedy_4
Trusted Contributor

Re: DL380G7 Uncorrectable Machine Check Exception

In the BIOS, change the Power Regulator mode to "Static High Performance". Seems to be a good fix so far.
M. Meckel
Occasional Advisor

Re: DL380G7 Uncorrectable Machine Check Exception

Hi James,

yes, this seems to be a temporary fix for this issue.

In Server BIOS set:

- Advance Power option -> change to = HP Static High Performance Mode.

- Minimum Processor Idle Power State -> No C-state

I found this workaround here:

"Absolute nightmare of a DL380 G7"

http://h30499.www3.hp.com/t5/ProLiant-Servers-ML-DL-SL/Absolute-nightmare-of-a-DL380-G7/m-p/4709685#M106891


So far no more MCEs. I keep my fingers crossed.

BUT: The green IT and power saving HP advertised its G7 line with is a big fat ridicule if you have to disable power saving get a stable machine.

Jase4772
Regular Advisor

Re: DL380G7 Uncorrectable Machine Check Exception

Having the same issue myself with a DL380 G7 but I've only gone for the C-State option to start with as a friend of mine had an issue with the Intel CPU and this resolved his issue.

 

I'm hoping it's this as I don't really want to impact the power usage as noticed it jump from 95 watts to 125 with the other setting.

 

Thanks for the help.

Jase

Systems Engineer_1
Occasional Visitor

Re: DL380G7 Uncorrectable Machine Check Exception

Same issue...

 

10/27/2011 15:52

Uncorrectable Machine Check Exception (Board 0, Processor 1, APIC ID 0x00000001, Bank 0x00000005, Status 0xB2000000'00800400, Address 0x00000000'00000000, Misc 0x00000000'00000000)

 

Machine went down hard over the weekend and I troubleshot it down to system board yesterday and had HP come out and replace the motherboard today and now I can't even get the machine to boot to a smartstart CD, let alone the OS, it just keeps cycling power when it comes time to load an OS.

 

I implemented high performance power and have also put the processors in no C-states mode.

 

Any other troubleshooting advice would be great, we have 9 other DL380 G7's and haven't had issue with them.

SFHR
Frequent Advisor

Re: DL380G7 Uncorrectable Machine Check Exception

Hi,

I guess its processor or VRM for processor causing the problem. Try replacing it with a new working one. I would suggest to follow a step by step HW troubleshooting flow chart.

 

Replace the VRM for Processor 1 and re check if itw orks

Replace Processor with a good one and see.

 

Repeat it for both processros.

 

Hope this will help. Please keep posted with results.

 

Regards,

 

Regards//
SF Hussain

Help others for better Tomorrow
__________________________________________________________
Please click the White Star Button should you like the Post for Points.,,