ProLiant Servers (ML,DL,SL)
1745790 Members
4065 Online
108722 Solutions
New Discussion юеВ

Re: ML370G2, Online Spare Memory and memory errors

 
Joe Hegyi
Advisor

ML370G2, Online Spare Memory and memory errors

I have a couple of ML370 G2 servers with Online Spare memory. At one time or another, both servers have had bad DIMMs caused the server to reboot (Uncorrectable memory error, socket x). My question is - when this happens, why doesn't the server fail over to the online spare bank? It seems pretty silly that I bought the extra memory for the spare bank, but still have these problems.

Thanks,
Joe
5 REPLIES 5
Jeff Wilson_3
Occasional Advisor

Re: ML370G2, Online Spare Memory and memory errors

We also have a number of ml370g2's with the same problem.

Speak to your Hp rep about a fix.

When a memory fault is detected a dialog box is supposed to appear over the O/S and tell you that online memory is now in use and the faulty memory should be replaced. This does not happen.

I have a small test application that causes this fault to occur (usually within 1 hour) of testing.

The problem does not appear to be Dimm related.
Mark_498
New Member

Re: ML370G2, Online Spare Memory and memory errors

You can downloading the latest firmware (it is supposed to fix the problem):

Systems ROMPaq Firmware Upgrade Diskette for ProLiant ML370 G2 (P25) Servers
version 4.07 P25-05/01/2004 (17 May 04)

found at:

http://h18023.www1.hp.com/support/files/server/us/download/20882.html

I have used this on a number of servers and so far the problem has no reoccurred (although I have not done extensive testing).
Kevin_276
New Member

Re: ML370G2, Online Spare Memory and memory errors

There was an issue with Online Spare memory support on the ML370 G2 and DL380 G2 in which an uncorrectable error could occur because Online Spare Mode was enabled. In this case, the uncorrectable error occurred because of Online Mode being enabled, not due to a problem with the DIMM itself. This issue has been addressed with the P25 (05/01/2004) ROM as mentioned by Mark. It is recommended you upgrade to this BIOS version if you are using Online Spare Mode.

However, in the general case, Online Spare Mode reduces the risk of an uncorrectable memory error. If an uncorrectable memory error actually occurs, the server cannot fail over to the online spare bank. Once the error occurs, the data is corrupted, and it is not possible to correct it. With Online Spare Mode, it is impossible to re-create the data once the uncorrectable error has occurred. The system is still protected against single-bit errors and certain classes of multi-bit errors (these are correctable errors), but not against uncorrectable errors. For protection against uncorrectable errors, HP offers Memory Mirroring and RAID Mode on the latest generation of many 500-series and 700-series platforms. These systems do provide protection against uncorrectable errors in that if the error occurs, the system switches to using good memory (for Mirroring) or re-constructing good data (for RAID). In both cases, the system continues to operate normally when an uncorrectable error occurs.

What Online Spare Memory does is switch over to the failed memory if the system detects that a DIMM is receiving correctable errors at a high rate. While the system will operate normally when a DIMM is receiving a high rate of correctable errors, such a DIMM is at a much higher risk of receiving an uncorrectable error, which would result in a system crash. The goal of Online Spare Memory is to deactivate memory that is receiving correctable errors at a high rate and replace it with properly functioning DIMMs in the Online Spare Bank. In this way, Online Spare Mode reduces the chances of the system suffering an uncorrectable error because it deactivates memory at a high risk of receiving an uncorrectable error. If a DIMM were to fail such that it suddenly went completely bad and received an uncorrectable error without ever exceeding HP's defined correctable error threshold rate, the system would still go down. However, if the DIMM were to degrade to a state where it was receiving a high rate of correctable errors, it would be deactivated before it could degrade further and cause the system to receive an uncorrectable error.

While not the same level of protection as Memory Mirroring or RAID, Online Spare Mode does reduce the probability of a system receiving an Uncorrectable Memory error and thus suffering unscheduled downtime.
Walk On!
Jeff Wilson_3
Occasional Advisor

Re: ML370G2, Online Spare Memory and memory errors

In the short term, if you don't feel comfortable loading a new bios or don't have the time, disabling the online spare memory will prevent the problem from re-occurring.

None of our servers have exhibited the fault with a conventional memory setup.



Joe Hegyi
Advisor

Re: ML370G2, Online Spare Memory and memory errors

Thanks to everyone for your help. I've disabled the online spare memory, and so far so good..