Memory Errors on a D350

Pedro Cirne
Esteemed Contributor

Is there a way to know the DIMM that is reporting errors on a D350? STM reports on EXTO 0a/0b, but is it 0a or 0b?

Thanks in advance,

Log. Phys Size leave
Bank Bank Board (MB) Disabled? SPA Range (Hex) Group
---- ---- ------------ ---- --------- --------------------- -----
9 0 EXT0 0a/0b 128 enabled 0x10000000 - 1fffffff 1
10 1 EXT0 0a/0b 128 enabled 0x10000000 - 1fffffff 1
1 0 EXT0 1a/1b 64 enabled 0x00000000 - 0fffffff 0
6 1 EXT0 1a/1b 64 enabled 0x00000000 - 0fffffff 0
3 0 EXT0 2a/2b 64 enabled 0x00000000 - 0fffffff 0
4 1 EXT0 2a/2b 64 enabled 0x00000000 - 0fffffff 0

Memory Error Log Summary

Board Error Address Error Type Page Count
------------ ------------------ ---------- --------- -----
EXT0 2a 0x000000000d936e30 Single-Bit 0x000d936 2
EXT0 0a/0b 0x00000000134dc000 Multi-Bit 0x00134dc 0
EXT0 0a/0b 0x0000000014809000 Multi-Bit 0x0014809 0
EXT0 0a/0b 0x0000000014038000 Multi-Bit 0x0014038 0
EXT0 0a/0b 0x00000000134d0000 Multi-Bit 0x00134d0 0
EXT0 0a/0b 0x00000000149be000 Multi-Bit 0x00149be 0
EXT0 0a/0b 0x000000001e464000 Multi-Bit 0x001e464 0
EXT0 0a/0b 0x00000000134f0000 Multi-Bit 0x00134f0 0
EXT0 0a/0b 0x0000000014af0000 Multi-Bit 0x0014af0 0
EXT0 0a/0b 0x0000000013d6f000 Multi-Bit 0x0013d6f 0
EXT0 0a/0b 0x0000000012cc0000 Multi-Bit 0x0012cc0 0
EXT0 0a/0b 0x0000000012b26000 Multi-Bit 0x0012b26 0
EXT0 0a/0b 0x0000000015bf9000 Multi-Bit 0x0015bf9 0
EXT0 0a/0b 0x00000000144cd000 Multi-Bit 0x00144cd 0
EXT0 0a/0b 0x0000000015756000 Multi-Bit 0x0015756 0
EXT0 0a/0b 0x000000001037c000 Multi-Bit 0x001037c 0
EXT0 0a/0b 0x0000000014b0c000 Multi-Bit 0x0014b0c 0
EXT0 0a/0b 0x0000000014a42000 Multi-Bit 0x0014a42 0
EXT0 0a/0b 0x0000000018b6f000 Multi-Bit 0x0018b6f 0
EXT0 0a/0b 0x0000000013ed4000 Multi-Bit 0x0013ed4 0
EXT0 0a/0b 0x000000001506e000 Multi-Bit 0x001506e 0

Kent Ostby
Honored Contributor

I believe that HP tends to replace the memory simms in pairs.

I'm not sure it would work, but you could always replace one of the simms and see if things clear up.

Best regards,

Kent M. Ostby
Shaikh Imran
Honored Contributor

It is advisable to change Both the memory modules in this case but still if you want you can just replace the memory module from the working slot keeping the tag and then isolate which is the exact one giving error.

Bill Hassell
Honored Contributor

Because of the design of the memory controller, memory interleaving prevents identifying the exact SIMM. That's why SIMMs must be installed in pairs and must be the same size for each pair. As Kent mentioned, replace one and see if the errors disappear. Be sure to clear the logs so old errors won't be seen.

Bill Hassell, sysadmin
Tonya Underwood
Regular Advisor

If these had been Single Bit errors, the exact memory module would have been apparent. If you look at the first error, you see that it is single-bit and identifies EXT0 DIMM 2A.

Many believe that replacing memory in pairs is necessary. However, it is not unless you have Multi-Bit Errors in which case you will not know which DIMM is the offending DIMM. Of course the newer Matterhorns, etc require that you replace the entire rank (x4).

In this case I would replace both DIMMs 0A and 0B on Extender 0.

Tonya Underwood
Dave Unverhau_1
Honored Contributor

The procedure we have used (when the opportunity presented itself and we didn't have two replacement SIMMs to use) was to swap one of the SIMMs of the offending pair with one of another pair (same size and type, of course) and keep an eye on the logs. If the errors move to a different bank, the one you moved is the bad one. If the errors stay on the same bank, it's the other one (or the carrier itself -- happens sometimes).

Occasionally, the errors vanish, in which case we might be able to attribute the errors to dirty contacts.

Good Luck!

Best Regards,

Pedro Cirne
Esteemed Contributor

