ProLiant Servers (ML,DL,SL)
1748109 Members
4815 Online
108758 Solutions
New Discussion

Re: DL380G7 Uncorrectable Machine Check Exception

 
Jase4772
Regular Advisor

Re: DL380G7 Uncorrectable Machine Check Exception

The C-State issue was resolved in the last BIOS firmware release for G7's. If your error is a result of the same problem then you should update as soon as possible.

Doug Herlovitch_1
Occasional Visitor

Re: DL380G7 Uncorrectable Machine Check Exception

Appears that HP has now fixed this with the May 5, 2011 BIOS update

https://support.hpe.com/hpesc/public/docDisplay?docId=emr_na-c02914393

Problems Fixed:

Resolved an issue that may result in any of the following conditions: operating system stops responding, unexpected system reset, Blue Screen when using a Microsoft Windows operating system, kernel panic when using a Linux operating system, or Purple Screen when using VMware ESX. A message may be displayed by the operating system or logged in the HP Integrated Management Log (IML) when this issue occurs indicating an "Uncorrectable Machine Check Exception." However, there are instances where the system resets before the operating system displays an error message and instances where the IML contains no log entry when this issue occurs. This issue does not occur if the Minimum Processor Idle State is configured for No C-states or C1E-state. The system is susceptible to this issue in the default Minimum Processor Idle State configuration.

Resolved an issue where PCI-Express Gen 3 option cards would run at PCI-Express Gen 1 speeds rather than the appropriate behavior of running at PCI-Express Gen 2 speeds. This server supports a maximum PCI-Express speed of Gen 2.

Resolved an issue in which uncorrectable memory errors (or other fatal system errors) will not be logged to the HP Integrated Management Log (IML) when using some revisions of VMware ESX Server. These errors will result in a fatal error (Purple Screen of Death - PSoD) under VMware ESX, but there will not be any indication of the error type (including no indication of an uncorrectable memory error or what DIMM has failed). A VMware ESX Server issue which can result in uncorrectable memory errors this is addressed in VMware ESX 4.1 U1 and VMware ESX 4.0 U3. This System ROM revision addresses the logging of errors to the IML.

[Note: broken link updated by Mod]

Robert Hawle
Advisor

Re: DL380G7 Uncorrectable Machine Check Exception

Running a Dl380g7 w2k8r2

Ihave had this Uncorrectable Machine Check Exception in ILM , updated to bios 05/05/2011.

now the system resets with the following ILM msg logged:

 

Operating System failure (Windows bug check, STOP: 0x00000080 (0x00000000004F4454, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000))
Uncorrectable PCI Express Error (Embedded device, Bus 0, Device 7, Function 0, Error status 0x00000000)

 

minidump tells This is typically due to a hardware malfunction.  The hardware supplier should
be called.

 

Installed HP Servicepack from 27.3.2012 the Server reboots twice a day.

anyone has same issues? any ideas?

shoud i exchange MB?

 

thx

 

Dan Gough
Occasional Advisor

Re: DL380G7 Uncorrectable Machine Check Exception

Hi All,

 

We are running Windows 2008 R2 SP1 Data Center Core Edition for our 4 node Hyper-V farm utilising HP Proliant DL385 G7's (Performance edition with extra RAM + Fiber HBAs), we've been suffering terrible stability problems since upgrading our nodes and had a Microsoft Premier call open. After setting the host OS up to support NMI Crash Dump we experienced a crash and tried to "Generate NMI to System" this failed and according to Microsoft this clarifies that the issue is with the hardware since the Non Maskable Interrupt is the highest level of interaction and should bypass any soft hangs. The only difference I can see is that following the NMI dump (which apparently didn't work) we did get a bug stop in the IML post-reboot which is more than we got before (so maybe it did work but just didn't force the blue screen and reboot).

 

This Bug stop lead me here...so for the record we run in Static High Power since we don't want to risk any performance glitches one our hosts (we figure since we've reduced physical foot print through virtualisation we dont need to justify additional power saving).

 

We are running all the very latest Windows patches (including the KB2568088 for Bulldozer to even get VM's to boot), I just checked the BIOS (2012.05.08 A18) and the despite Static High Performance we have a default Minimum Processor Idle Power State of "Core C6 (CC6) State" I've now changed this on the most recent node to crash to "No C-States"

 

Our biggest issue here is that the hangs are not following a pattern we've had anything from 1 week apart to 1 month and its happened on 3 out of the 4 nodes at different points so I find a physical hardware issue unlikely.

 

We have a ticket open with HP and we're escalating so I will post if we get any updates, but I am interested to know if you guys stayed on No C-States to avoid the issue or if that earlier BIOS resolved the issue in your CPU's?

 

Also for the record we are running the 16 Core AMD's with Bulldozer (6282SE's).

 

Interesting times ahead gaining business confidence back!

Dan Gough
Occasional Advisor

Re: DL380G7 Uncorrectable Machine Check Exception

Seems this might be a curve-ball having refreshed my memory on the events around our last crash the Bug Stop was generated when we tested the NMI feature post reboot, as stated during the hang the NMI failed to do anything...guess we will have to wait for our HP ticket to escalate!
cgchavero
New Member

Re: DL380G7 Uncorrectable Machine Check Exception

This workaround works for me too, this is the scenario:

 

- HP Proliant DL 160 Gen 8 E5-2603

- HP 1 TB 6GB SAS 7.2K 3.5in SC MDL HDD

- HP Smart Array P420/1GB FBWC Controller

- Sangoma PCI Wildcard A102D (PCI Express 2.0)

- Elastix 2.4

- CentOS release 5.10

- Kernel 2.6.18-371.3.1.el5

 

Rigth now i can count 40 day without MCEs and... I keep my finger crossed too..

 

I attach the screen shot of iLo

Server-Support
Super Advisor

Re: DL380G7 Uncorrectable Machine Check Exception

@Glen Coghlan yes, same here. my HP BL 465c G7 blade servers which was running for more than 2 years has just rebooted today during the business hours.

Here's the IML logs:

Uncorrectable Machine Check Exception (Board 0, Processor 1, APIC ID 0x00000010, Bank 0x00000004, Status 0xF2000000'00070F0F, Address 0x00000000'00000000, Misc 0x00000000'00000000)
Uncorrectable Chipset Error (Error status 1 0x0018C154, Error status 2 0x00244000)
Uncorrectable Chipset Error (Error status 1 0x0018C160, Error status 2 0x00002040)
Uncorrectable Chipset Error (Error status 1 0x0018C16C, Error status 2 0x20000080)
Uncorrectable Chipset Error (Error status 1 0x0018C170, Error status 2 0x040406FF)
Uncorrectable Chipset Error (Error status 1 0x0018C174, Error status 2 0x00000003)
Uncorrectable Chipset Error (Error status 1 0x0018C178, Error status 2 0x9452EA00)

My Server ROM is on A19 12/08/2012 but according to http://h20565.www2.hpe.com/hpsc/doc/public/display?docId=emr_na-c03250482 The system ROM dated 12.31.2011 corrects this issue which is older ?

Best regards,
teojaimes
New Member

Re: DL380G7 Uncorrectable Machine Check Exception

@Server-Support, did your issue get resolved?  We have now experienced the reboot and "Uncorrectable Machine Check Exception" IML entry on two separate production servers, both DL385 G7.