ProLiant Servers (ML,DL,SL)
1751687 Members
5566 Online
108781 Solutions
New Discussion

ProLiant DL585 G7 DRAM ECC error

 
JuanCH
Frequent Visitor

ProLiant DL585 G7 DRAM ECC error

Hello all,

We have 4 ProLiant DL585 G7 servers out of warranty

- 2 servers are HP ProLiant DL 585 G7, configured for 12 cores
- 2 servers are HP ProLiant DL 585 G7, configured for 16 cores

All of them are running continuously since the installation. But recently a hardware error has appeared on both of the 12 core machines. 

on one:
[6896093.455573] [Hardware Error]: MC4 Error (node 1): DRAM ECC error detected on the NB.
[6896093.468312] EDAC MC1: 1 CE on mc#1csrow#0channel#0 (csrow:0
channel:0 page:0xc9139a offset:0xb30 grain:0 syndrome:0xb903) [6896093.468317] [Hardware Error]: Error Status: Corrected error, no action required.

on the other:
[6653460.796494] [Hardware Error]: MC4 Error (node 7): DRAM ECC error detected on the NB.
[6653460.809233] EDAC MC7: 1 CE on mc#7csrow#0channel#0 (csrow:0
channel:0 page:0x5531e0b offset:0xfb0 grain:0 syndrome:0x100) [6653460.809245] [Hardware Error]: Error Status: Corrected error, no action required.

The error is correcting itself, but it's quite annoying.

Both machines have the same amount of memory (384 GB, 24 modules of 16 GB).
It appears only when you go to "top" performance, that is, under heavy use.
It is independent of the OS, I changed recently from Scientific Linux 7 to CentOS 7.

Any idea about why the error is appearing, and what to do to fix it? 

Thanks in advance!

2 REPLIES 2
parnassus
Honored Contributor

Re: ProLiant DL585 G7 DRAM ECC error

IMHO this blog post seems to be very pertinent (Reported errors were taken from SL/CentOS Linux dmesg/syslog...isn't it?).


I'm not an HPE Employee
Kudos and Accepted Solution banner
JuanCH
Frequent Visitor

Re: ProLiant DL585 G7 DRAM ECC error

Great thanks! So I read that there are modules that are giving errors. On one of them:

mc1: csrow0: mc#1csrow#0channel#0: 83 Corrected Errors

And on the other

mc7: csrow0: mc#7csrow#0channel#0: 869 Corrected Errors

This is already a big advance. But how can I "physically" localize them?  Is there a standard naming, or should I try to swap the module until "edac" is no more giving errors? Thanks for your help in advance!