- Community Home
- >
- Servers and Operating Systems
- >
- ProLiant
- >
- ProLiant Servers (ML,DL,SL)
- >
- DL380 Gen9 UMCE might be caused by firmware bug
-
- Forums
-
- Advancing Life & Work
- Advantage EX
- Alliances
- Around the Storage Block
- HPE Ezmeral: Uncut
- OEM Solutions
- Servers & Systems: The Right Compute
- Tech Insights
- The Cloud Experience Everywhere
- HPE Blog, Austria, Germany & Switzerland
- Blog HPE, France
- HPE Blog, Italy
- HPE Blog, Japan
- HPE Blog, Middle East
- HPE Blog, Russia
- HPE Blog, Saudi Arabia
- HPE Blog, South Africa
- HPE Blog, UK & Ireland
-
Blogs
- Advancing Life & Work
- Advantage EX
- Alliances
- Around the Storage Block
- HPE Blog, Latin America
- HPE Blog, Middle East
- HPE Blog, Saudi Arabia
- HPE Blog, South Africa
- HPE Blog, UK & Ireland
- HPE Ezmeral: Uncut
- OEM Solutions
- Servers & Systems: The Right Compute
- Tech Insights
- The Cloud Experience Everywhere
-
Information
- Community
- Welcome
- Getting Started
- FAQ
- Ranking Overview
- Rules of Participation
- Tips and Tricks
- Resources
- Announcements
- Email us
- Feedback
- Information Libraries
- Integrated Systems
- Networking
- Servers
- Storage
- Other HPE Sites
- Support Center
- Aruba Airheads Community
- Enterprise.nxt
- HPE Dev Community
- Cloud28+ Community
- Marketplace
-
Forums
-
Blogs
-
Information
-
English
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
11-25-2019 01:04 AM - edited 11-25-2019 01:06 AM
11-25-2019 01:04 AM - edited 11-25-2019 01:06 AM
DL380 Gen9 UMCE might be caused by firmware bug
Problem:
Under certain circumstance(detail will discuss later in this article), server automatically reboot and iLO record several " Machine Check Exception" like:
"Uncorrectable Machine Check Exception (Board 0, Processor %, APIC ID 0x000000%%, Bank 0x000000%% Status 0x***00000'000C110A, Address 0x********'********, Misc 0x********'********)"
The % in Processor clause could be 1 or 2; the %% in APIC clause could be any number; the %% in Bank clause could be 03/11/12/13; * refer to any number or letter.
System configuration:
CPU:
2*E5-2650v3
Mem:
4*16G RDIMM (2*16G for each CPU)
Other devices:
No PCIE device, no internet, no input/output device(using iLO remote console access the server)
OS:
CentOS 7
How to trigger or reproduce:
I use this server for data analysis, coding and calculation purpose. The bug was initially catched by start and then configure a IDE. After a short test, I found that the stress test program called mprime(from https://www.mersenne.org/download/#download, and choose stress testing MEMORY or both CPU and MEMORY) could reproduce it steadily.
Progress of elimination:
A. Hardware(CPU/MB/Mem):
4 different CPU, 8 different memory, 2 power supplier and 2 different server(for MB test) have been used for cross test to eliminate hardware defect.
1. Server would work properly if only one CPU was installed (1 CPU and 4*16G mem would install on server for one test, run many test for different hardware combination).
2. Server would work properly if ONLY stress test CPU when 2 CPU installed, but failed when test memory or both memory and CPU (mprime allow user choosing test only CPU or Memory or both of them).
B. Environmental factors:
1. power: I use 3 different power cord, 3 power outlet detecting connection problem. The voltage and GND connection were confirmed by a new multimeter.
2. Temperature and humidity also comply with HP‘s technical spec.
C. Software (OS/Firmware/stress test program):
1. OS:
Official supported CentOS 7 were used for most of testing and archlinux for latest kernel test.
2. Firmware:
Originally, the test based on 2.5x ROM, and then the firmware updated to 2.74 using SPP 2019.09.
3. Stress test program:
I don't change the stress test program because the bug initially catch by daily using of server not by the stress test process. This program is just used for reproducing the bug and obviously not the source of the bug.
4. Settings in RBSU:
I did some google work first in this forum and changing the power policy from similar bug suggestion yet nothing help. Then after I researching plenty of cpu and memory tech documents, I changed the snooping option to Early snooping(default is Home snooping). It worked!!! But unfortunately, the Cluster on Die mode also meet this problem but seems the server will work several minutes after stress test starting using Cluster on Die(server would restart immediately in most case when working on Home snooping mode).
Discussion:
I suspect the problem is caused by the firmware for several reasons:
1. It is highly probable someone filing the bug report on web before if this type CPU included hardware bug, but I find nothing on Intel website and other place.
2. The MCE code refer nothing special according to Intel reference.
3. Based on Intel's introduction of Haswell arch and NUMA's working mechanism, enable "Node Interleave" in RBSU should mitigate the problem, but it not work.
4. There is very little plausibility that 4 different CPU used for test have same hardware defect.
5. Linux system could rarely involved in the caching policy of CPU based on Linux kernel reference and seems no relevant bug had been report before.
6. Early snooping mode working well proved that the mother board and other devices(storage/input/output) might not relative with this problem.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
11-28-2019 06:32 PM
11-28-2019 06:32 PM
Re: DL380 Gen9 UMCE might be caused by firmware bug
Have anybody faced a similar problem?
Or any HPE expert could give some advice ?
Please help me, thank you.
Hewlett Packard Enterprise International
- Communities
- HPE Blogs and Forum
© Copyright 2021 Hewlett Packard Enterprise Development LP