- Integrated Systems
- About Us
- Integrated Systems
- About Us
3 weeks ago - last edited 3 weeks ago
DL380 Gen9 UMCE might be caused by firmware bug
Under certain circumstance(detail will discuss later in this article), server automatically reboot and iLO record several " Machine Check Exception" like:
"Uncorrectable Machine Check Exception (Board 0, Processor %, APIC ID 0x000000%%, Bank 0x000000%% Status 0x***00000'000C110A, Address 0x********'********, Misc 0x********'********)"
The % in Processor clause could be 1 or 2; the %% in APIC clause could be any number; the %% in Bank clause could be 03/11/12/13; * refer to any number or letter.
4*16G RDIMM (2*16G for each CPU)
No PCIE device, no internet, no input/output device(using iLO remote console access the server)
How to trigger or reproduce:
I use this server for data analysis, coding and calculation purpose. The bug was initially catched by start and then configure a IDE. After a short test, I found that the stress test program called mprime(from https://www.mersenne.org/download/#download, and choose stress testing MEMORY or both CPU and MEMORY) could reproduce it steadily.
Progress of elimination:
4 different CPU, 8 different memory, 2 power supplier and 2 different server(for MB test) have been used for cross test to eliminate hardware defect.
1. Server would work properly if only one CPU was installed (1 CPU and 4*16G mem would install on server for one test, run many test for different hardware combination).
2. Server would work properly if ONLY stress test CPU when 2 CPU installed, but failed when test memory or both memory and CPU (mprime allow user choosing test only CPU or Memory or both of them).
B. Environmental factors:
1. power: I use 3 different power cord, 3 power outlet detecting connection problem. The voltage and GND connection were confirmed by a new multimeter.
2. Temperature and humidity also comply with HP‘s technical spec.
C. Software (OS/Firmware/stress test program):
Official supported CentOS 7 were used for most of testing and archlinux for latest kernel test.
Originally, the test based on 2.5x ROM, and then the firmware updated to 2.74 using SPP 2019.09.
3. Stress test program:
I don't change the stress test program because the bug initially catch by daily using of server not by the stress test process. This program is just used for reproducing the bug and obviously not the source of the bug.
4. Settings in RBSU:
I did some google work first in this forum and changing the power policy from similar bug suggestion yet nothing help. Then after I researching plenty of cpu and memory tech documents, I changed the snooping option to Early snooping(default is Home snooping). It worked!!! But unfortunately, the Cluster on Die mode also meet this problem but seems the server will work several minutes after stress test starting using Cluster on Die(server would restart immediately in most case when working on Home snooping mode).
I suspect the problem is caused by the firmware for several reasons:
1. It is highly probable someone filing the bug report on web before if this type CPU included hardware bug, but I find nothing on Intel website and other place.
2. The MCE code refer nothing special according to Intel reference.
3. Based on Intel's introduction of Haswell arch and NUMA's working mechanism, enable "Node Interleave" in RBSU should mitigate the problem, but it not work.
4. There is very little plausibility that 4 different CPU used for test have same hardware defect.
5. Linux system could rarely involved in the caching policy of CPU based on Linux kernel reference and seems no relevant bug had been report before.
6. Early snooping mode working well proved that the mother board and other devices(storage/input/output) might not relative with this problem.