ProLiant Servers (ML,DL,SL)
cancel
Showing results for 
Search instead for 
Did you mean: 

DL380G7 Uncorrectable Machine Check Exception

M. Meckel
Occasional Advisor

DL380G7 Uncorrectable Machine Check Exception

Hi there,

i deployed a new DL380 G7 with the following specs:

2x Xeon X5650 CPU (2.66 MHz), 6/6 cores; 12 threads
8x 8192 MB RAM 1333 MHz
1x Embedded P410i with 1GB FBWC

Firmware:

BIOS: 12/01/2010
iLo3: 1.16
P410i: 3.66

OS: Debian Squeeze
Kernel: 2.6.32-5-amd64

BIOS Setting for Power-Saving was set to "OS Control mode" and on Debian the package cpufrequtils was installed (which sets the CPU scheduler to "ondemand" for all CPUs).

While running some tests the box suddenly crashed hard:

http://test.thermoman.de/images/hp/dl380g7.kernel.panic.png

Integrated Management Log says:

Class: System Error
Description: An Unrecoverable System Error (NMI) has occurred (System error code 0x00000000, 0x00000000)

Class: CPU
Description: Uncorrectable Machine Check Exception (Board 0, Processor 2, APIC ID 0x00000020, Bank 0x00000005, Status 0xBA000000'00400405, Address 0x00000000'00000000, Misc 0x00000000'00004100)

Class: CPU
Description: Uncorrectable Machine Check Exception (Board 0, Processor 2, APIC ID 0x00000021, Bank 0x00000005, Status 0xBA000000'00400405, Address 0x00000000'00000000, Misc 0x00000000'00004100)

See http://test.thermoman.de/images/hp/dl380g7.ilo.iml.png

I googled this error and found some threads here on HP IT Resource Center regarding a bug with 2 NICs being enabled for PXE (not the case) and others suggesting problem with system board or CPU.

Since i didn't find the mentioned Numbers (Status 0xBA000000'00400405) anywhere on the web i thought post it here for other lost souls :)

Solution?

1. upgraded BIOS Firmware to version 01/30/2011
2. memtest86+ - Result: no errors found
3. disabled cpufrequtils on Debian so CPUs don't get clocked down for power saving
4. running stress test at the moment, no definite results yet.

Can someone tell me what part is being referenced by the IML status codes above? Is it CPU #2 that is detected as being faulty?

Thanks in advance!

Greetings,
Marcel.
17 REPLIES
James Kennedy_4
Trusted Contributor

Re: DL380G7 Uncorrectable Machine Check Exception

We are having this same issue with one of our DL380 G7s. I read through the fixes on the other thread as well, but none of them resolved the problem.

All firmware and drivers are up to date.

It appears to be a hardware problem though, as not all of our DL380 G7s have this issue.

I'll let you know if I come upon a valid fix.
M. Meckel
Occasional Advisor

Re: DL380G7 Uncorrectable Machine Check Exception

The System Board now gets replaced after the machine hung itself again even with newest BIOS installed (01/30/2011).

I'll let you know if the swap fixes the problem.
M. Meckel
Occasional Advisor

Re: DL380G7 Uncorrectable Machine Check Exception

System board got replaced. I upgraded BIOS firmware to the latest available (01/30/2011) and did run my stress tests again.

Result after 24 hours: machine crashed again.

Integrated Management Log says:

Class: System Error
Description: An Unrecoverable System Error (NMI) has occurred (System error code 0x00000000, 0x00000000)

Kernel Panic output looks the same as the above linked image.

BIOS Setting for Power-Saving was set to "OS Control mode" and the package cpufrequtils this time was NOT installed.

I'll now for the rest of the weekend try with "HP Static High Performance Mode" (as suggested in some thread as workaround from HP).
Glen Coghlan
Occasional Visitor

Re: DL380G7 Uncorrectable Machine Check Exception

Having same issues here in Australia with multiple DL 380 G7's running WIndows Server 2008 R2 SP1 with all latest bios fixes. Have logged a support case with HP. Will reply back with outcome.

James Kennedy_4
Trusted Contributor

Re: DL380G7 Uncorrectable Machine Check Exception

In the BIOS, change the Power Regulator mode to "Static High Performance". Seems to be a good fix so far.
M. Meckel
Occasional Advisor

Re: DL380G7 Uncorrectable Machine Check Exception

Hi James,

yes, this seems to be a temporary fix for this issue.

In Server BIOS set:

- Advance Power option -> change to = HP Static High Performance Mode.

- Minimum Processor Idle Power State -> No C-state

I found this workaround here:

"Absolute nightmare of a DL380 G7"

http://h30499.www3.hp.com/t5/ProLiant-Servers-ML-DL-SL/Absolute-nightmare-of-a-DL380-G7/m-p/4709685#M106891


So far no more MCEs. I keep my fingers crossed.

BUT: The green IT and power saving HP advertised its G7 line with is a big fat ridicule if you have to disable power saving get a stable machine.

Jase4772
Regular Advisor

Re: DL380G7 Uncorrectable Machine Check Exception

Having the same issue myself with a DL380 G7 but I've only gone for the C-State option to start with as a friend of mine had an issue with the Intel CPU and this resolved his issue.

 

I'm hoping it's this as I don't really want to impact the power usage as noticed it jump from 95 watts to 125 with the other setting.

 

Thanks for the help.

Jase

Systems Engineer_1
Occasional Visitor

Re: DL380G7 Uncorrectable Machine Check Exception

Same issue...

 

10/27/2011 15:52

Uncorrectable Machine Check Exception (Board 0, Processor 1, APIC ID 0x00000001, Bank 0x00000005, Status 0xB2000000'00800400, Address 0x00000000'00000000, Misc 0x00000000'00000000)

 

Machine went down hard over the weekend and I troubleshot it down to system board yesterday and had HP come out and replace the motherboard today and now I can't even get the machine to boot to a smartstart CD, let alone the OS, it just keeps cycling power when it comes time to load an OS.

 

I implemented high performance power and have also put the processors in no C-states mode.

 

Any other troubleshooting advice would be great, we have 9 other DL380 G7's and haven't had issue with them.

SFHR
Frequent Advisor

Re: DL380G7 Uncorrectable Machine Check Exception

Hi,

I guess its processor or VRM for processor causing the problem. Try replacing it with a new working one. I would suggest to follow a step by step HW troubleshooting flow chart.

 

Replace the VRM for Processor 1 and re check if itw orks

Replace Processor with a good one and see.

 

Repeat it for both processros.

 

Hope this will help. Please keep posted with results.

 

Regards,

 

Regards//
SF Hussain

Help others for better Tomorrow
__________________________________________________________
Please click the White Star Button should you like the Post for Points.,,
Jase4772
Regular Advisor

Re: DL380G7 Uncorrectable Machine Check Exception

The C-State issue was resolved in the last BIOS firmware release for G7's. If your error is a result of the same problem then you should update as soon as possible.

Doug Herlovitch_1
Occasional Visitor

Re: DL380G7 Uncorrectable Machine Check Exception

Appears that HP has now fixed this with the May 5, 2011 BIOS update

http://h20566.www2.hp.com/portal/site/hpsc/template.PAGE/public/psi/swdDetails/?sp4ts.oid=4161635&spf_p.tpst=psiSwdMain&spf_p.prp_psiSwdMain=wsrp-navigationalState%3Dlang%253Den%257Ccc%253DUS%257CprodSeriesId%253D4091412%257CswItem%253DMTX-6f110902f85648beacca7ffb2f%257CprodNameId%253D4161635%257CswEnvOID%253D4024%257CswLang%253D8%257Cmode%253D4%257Cidx%253D3%257Caction%253DdriverDocument&javax.po...

 

Problems Fixed:

Resolved an issue that may result in any of the following conditions: operating system stops responding, unexpected system reset, Blue Screen when using a Microsoft Windows operating system, kernel panic when using a Linux operating system, or Purple Screen when using VMware ESX. A message may be displayed by the operating system or logged in the HP Integrated Management Log (IML) when this issue occurs indicating an "Uncorrectable Machine Check Exception." However, there are instances where the system resets before the operating system displays an error message and instances where the IML contains no log entry when this issue occurs. This issue does not occur if the Minimum Processor Idle State is configured for No C-states or C1E-state. The system is susceptible to this issue in the default Minimum Processor Idle State configuration.

Resolved an issue where PCI-Express Gen 3 option cards would run at PCI-Express Gen 1 speeds rather than the appropriate behavior of running at PCI-Express Gen 2 speeds. This server supports a maximum PCI-Express speed of Gen 2.

Resolved an issue in which uncorrectable memory errors (or other fatal system errors) will not be logged to the HP Integrated Management Log (IML) when using some revisions of VMware ESX Server. These errors will result in a fatal error (Purple Screen of Death - PSoD) under VMware ESX, but there will not be any indication of the error type (including no indication of an uncorrectable memory error or what DIMM has failed). A VMware ESX Server issue which can result in uncorrectable memory errors this is addressed in VMware ESX 4.1 U1 and VMware ESX 4.0 U3. This System ROM revision addresses the logging of errors to the IML.

 

Robert Hawle
Advisor

Re: DL380G7 Uncorrectable Machine Check Exception

Running a Dl380g7 w2k8r2

Ihave had this Uncorrectable Machine Check Exception in ILM , updated to bios 05/05/2011.

now the system resets with the following ILM msg logged:

 

Operating System failure (Windows bug check, STOP: 0x00000080 (0x00000000004F4454, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000))
Uncorrectable PCI Express Error (Embedded device, Bus 0, Device 7, Function 0, Error status 0x00000000)

 

minidump tells This is typically due to a hardware malfunction.  The hardware supplier should
be called.

 

Installed HP Servicepack from 27.3.2012 the Server reboots twice a day.

anyone has same issues? any ideas?

shoud i exchange MB?

 

thx

 

Dan Gough
Occasional Advisor

Re: DL380G7 Uncorrectable Machine Check Exception

Hi All,

 

We are running Windows 2008 R2 SP1 Data Center Core Edition for our 4 node Hyper-V farm utilising HP Proliant DL385 G7's (Performance edition with extra RAM + Fiber HBAs), we've been suffering terrible stability problems since upgrading our nodes and had a Microsoft Premier call open. After setting the host OS up to support NMI Crash Dump we experienced a crash and tried to "Generate NMI to System" this failed and according to Microsoft this clarifies that the issue is with the hardware since the Non Maskable Interrupt is the highest level of interaction and should bypass any soft hangs. The only difference I can see is that following the NMI dump (which apparently didn't work) we did get a bug stop in the IML post-reboot which is more than we got before (so maybe it did work but just didn't force the blue screen and reboot).

 

This Bug stop lead me here...so for the record we run in Static High Power since we don't want to risk any performance glitches one our hosts (we figure since we've reduced physical foot print through virtualisation we dont need to justify additional power saving).

 

We are running all the very latest Windows patches (including the KB2568088 for Bulldozer to even get VM's to boot), I just checked the BIOS (2012.05.08 A18) and the despite Static High Performance we have a default Minimum Processor Idle Power State of "Core C6 (CC6) State" I've now changed this on the most recent node to crash to "No C-States"

 

Our biggest issue here is that the hangs are not following a pattern we've had anything from 1 week apart to 1 month and its happened on 3 out of the 4 nodes at different points so I find a physical hardware issue unlikely.

 

We have a ticket open with HP and we're escalating so I will post if we get any updates, but I am interested to know if you guys stayed on No C-States to avoid the issue or if that earlier BIOS resolved the issue in your CPU's?

 

Also for the record we are running the 16 Core AMD's with Bulldozer (6282SE's).

 

Interesting times ahead gaining business confidence back!

Dan Gough
Occasional Advisor

Re: DL380G7 Uncorrectable Machine Check Exception

Seems this might be a curve-ball having refreshed my memory on the events around our last crash the Bug Stop was generated when we tested the NMI feature post reboot, as stated during the hang the NMI failed to do anything...guess we will have to wait for our HP ticket to escalate!
cgchavero
Occasional Visitor

Re: DL380G7 Uncorrectable Machine Check Exception

This workaround works for me too, this is the scenario:

 

- HP Proliant DL 160 Gen 8 E5-2603

- HP 1 TB 6GB SAS 7.2K 3.5in SC MDL HDD

- HP Smart Array P420/1GB FBWC Controller

- Sangoma PCI Wildcard A102D (PCI Express 2.0)

- Elastix 2.4

- CentOS release 5.10

- Kernel 2.6.18-371.3.1.el5

 

Rigth now i can count 40 day without MCEs and... I keep my finger crossed too..

 

I attach the screen shot of iLo

Server-Support
Super Advisor

Re: DL380G7 Uncorrectable Machine Check Exception

@Glen Coghlan yes, same here. my HP BL 465c G7 blade servers which was running for more than 2 years has just rebooted today during the business hours.

Here's the IML logs:

Uncorrectable Machine Check Exception (Board 0, Processor 1, APIC ID 0x00000010, Bank 0x00000004, Status 0xF2000000'00070F0F, Address 0x00000000'00000000, Misc 0x00000000'00000000)
Uncorrectable Chipset Error (Error status 1 0x0018C154, Error status 2 0x00244000)
Uncorrectable Chipset Error (Error status 1 0x0018C160, Error status 2 0x00002040)
Uncorrectable Chipset Error (Error status 1 0x0018C16C, Error status 2 0x20000080)
Uncorrectable Chipset Error (Error status 1 0x0018C170, Error status 2 0x040406FF)
Uncorrectable Chipset Error (Error status 1 0x0018C174, Error status 2 0x00000003)
Uncorrectable Chipset Error (Error status 1 0x0018C178, Error status 2 0x9452EA00)

My Server ROM is on A19 12/08/2012 but according to http://h20565.www2.hpe.com/hpsc/doc/public/display?docId=emr_na-c03250482 The system ROM dated 12.31.2011 corrects this issue which is older ?

teojaimes
Occasional Visitor

Re: DL380G7 Uncorrectable Machine Check Exception

@Server-Support, did your issue get resolved?  We have now experienced the reboot and "Uncorrectable Machine Check Exception" IML entry on two separate production servers, both DL385 G7.