ProLiant Servers (ML,DL,SL)
cancel
Showing results for 
Search instead for 
Did you mean: 

ECC Correctable, Sensor Number 4

mitchhed
Occasional Visitor

ECC Correctable, Sensor Number 4

With a DL160 G5, we've experienced a crash which seems to be due to a bad memory module.

Through IPMI we're receiving (correctable) memory error events like this:

SEL Record ID : 0050
Record Type : 02
Timestamp : 04/03/2010 09:22:18
Generator ID : 0002
EvM Revision : 04
Sensor Type : Memory
Sensor Number : 04
Event Type : Sensor-specific Discrete
Event Direction : Assertion Event
Event Data : 000cff
Description : Correctable ECC

About 4 to 5 of these events are generated each day.

It's unclear which DIMM this sensor is related to. Sensor Number 4 doesn't appear as a sensor number in the listing of the sensors while the sensor number for "Memory ECC" is sensor number 2.

It would seem that sensor number 4 refers to a subsensor of the memory sensors but how can we find out which DIMM slot relates with sensor number 4?



3 REPLIES
Sheanshar
Advisor

Re: ECC Correctable, Sensor Number 4

What's the server firmware version?

Check this out:


HP ProLiant DL160 G5 Server Series - "Lower Critical-going low Assertion" and "Lower Critical-going low Deassertion" Events Appears Within the IPMI Log and Shuts Down a Number of Servers
Issue

A population of HP ProLiant DL160 G5 servers may shut down seemingly randomly. Within the IPMI log, variants of the following message may appear:


Generic 02/01/2009 17:51:40 FAN4 ROTOR1 Lower Critical-going low Assertion Generic 02/01/2009 17:51:41 FAN4 ROTOR1 Lower Critical-going low Deassertion

http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?lang=en&cc=us&taskId=110&prodSeriesId=3663526&prodTypeId=12169&objectID=c01884069

The fan in question may vary.
Solution

The issue is resolved in a Single Point release server firmware 2009.4.13 , and later regular releases.
mitchhed
Occasional Visitor

Re: ECC Correctable, Sensor Number 4

The server firmware/BIOS is:

version: O12 (04/09/2008)

We'll upgrade this to the latest version.

If the server is said to "shutdown" due to this, does this mean the server somewhat gracefully shuts down or simply crashes? I ask this because when this issue occurs with us, the Proliant is completely stuck. No video output and no response to the keyboard. The only way to restart the server (besides pulling the power plug) is to press the power button for at least 5 seconds.

Still, with 4 to 5 of these ECC messages being generated per day, an upgrade of the firmware/BIOS will quickly indicate or at least suggest whether the firmware is related to this issue.

Thanks
mitchhed
Occasional Visitor

Re: ECC Correctable, Sensor Number 4

The BIOS and firmware have been upgraded to BIOS version: O12 (07/27/2009) and firmware version 3.11.

Since this upgrade there have been no more memory or other errors reported in the system event log.

I hope this was the cause of the crash. Thanks for the advice.