BladeSystem - General
cancel
Showing results for 
Search instead for 
Did you mean: 

bl860c memory issue

Brent Henderson
Occasional Visitor

bl860c memory issue

I have a bl860c blade that is having memory issues and I need help debugging which DIMMs have issues. If I clear the logs and power on, the following log entries appear to be interesting.

SFW 0 2 0x44800B9000E01ED0 FFFFFFFF005BFF74 MEM_SELFTEST_MBE_IN_RANK
19 Oct 2009 16:37:47
SFW *5 0xC14ADC95DB021EF0 FF3F4070000F0300 SYSTEM_FIRMWARE_ERROR
19 Oct 2009 16:37:47
SFW 0 *5 0xA08000D800E01F00 0000000000000000 MEM_PDT_TABLE_FULL
19 Oct 2009 16:37:47
SFW 0 0 0x000000EF00E00000 0000000000000000 MEM_TEST_READ
SFW 0 2 0x44800B9000E01F20 FFFFFFFF005AFF74 MEM_SELFTEST_MBE_IN_RANK
19 Oct 2009 16:37:48
SFW 0 2 0x44800B9000E01F40 FFFFFFFF005BFF74 MEM_SELFTEST_MBE_IN_RANK
19 Oct 2009 16:37:48
SFW *5 0xC14ADC95DC021F60 FF3F4070000F0300 SYSTEM_FIRMWARE_ERROR
19 Oct 2009 16:37:48
SFW 0 *5 0xA08000D800E01F70 0000000000000000 MEM_PDT_TABLE_FULL
19 Oct 2009 16:37:48
SFW 0 *3 0x608000B900E01F90 0000000000000000 MEM_ENOUGH_MEM_FAILED
19 Oct 2009 16:37:48

There are lots more messages, but these seemed the most important. Sometimes the system will make it to the EFI shell, but most of the time it won't. When it does make it, it won't last for long once I try running things. :)

What is the best approach at determining the bad DIMMs so I can have them replaced. I have extra DIMMs, but I'll need to send them to the technician that can touch the box as it is 400 miles away from me at the moment. :) I do have MP/iLO access though, so if I can disable DIMMs selectively, that would be a good option.
4 REPLIES
Torsten.
Acclaimed Contributor

Re: bl860c memory issue

MEM_PDT_TABLE_FULL

PDT is the page deallocation table - a list with "bad" blocks in memory. The message means there are too much.


What do you get from EFI command

Shell> info mem

Example:

Shell> info mem

MEMORY INFORMATION

Extender 0:
---- DIMM A ----- ---- DIMM B ----- ---- DIMM C ----- ---- DIMM D -----
DIMM Current DIMM Current DIMM Current DIMM Current
--- ------ ---------- ------ ---------- ------ ---------- ------ ----------
0 2048MB Active 2048MB Active 2048MB Active 2048MB Active
1 1024MB Active 1024MB Active 1024MB Active 1024MB Active
2 1024MB Active 1024MB Active 1024MB Active 1024MB Active



Any not-active?

What OS is installed? Are you able to boot?

Hope this helps!
Regards
Torsten.

__________________________________________________
There are only 10 types of people in the world -
those who understand binary, and those who don't.

__________________________________________________
No support by private messages. Please ask the forum!

If you feel this was helpful please click the KUDOS! thumb below!   
Torsten.
Acclaimed Contributor

Re: bl860c memory issue

The above example was from another system model, but this is from a similar:

MEMORY INFORMATION

---- DIMM A ----- ---- DIMM B -----
DIMM Current DIMM Current
--- ------ ---------- ------ ----------
0 4096MB Active 4096MB Active
1 4096MB Active 4096MB Active
2 ---- ----
3 ---- ----
4 ---- ----
5 ---- ----


Hope this helps!
Regards
Torsten.

__________________________________________________
There are only 10 types of people in the world -
those who understand binary, and those who don't.

__________________________________________________
No support by private messages. Please ask the forum!

If you feel this was helpful please click the KUDOS! thumb below!   
Brent Henderson
Occasional Visitor

Re: bl860c memory issue

O.k., apparently the answer can be found in two ways as to which DIMMs are bad. First, was the console. During one (and only one!) of my boot attempts, the bad DIMMs were clearly called out on the console:

2 0 0x000B90 0xFFFFFFFF005AFF74 uncorrectable error in DIMM 5A during selftest
2 0 0x000B90 0xFFFFFFFF005BFF74 uncorrectable error in DIMM 5B during selftest

This led me to the 2nd approach which is probably better. From the MP, select 'SL' to view the logs. From there, select 'E' for system events. Now select 'T' to show the events in text mode. Finally, select 'L' to show the last message first and just start hitting return. You should then see the decoded messages in the logs that I showed originally.

Log Entry 522: 19 Oct 2009 21:19:54
Alert Level 3: Warning
Keyword: SEL_ALMOST_FULL
System event log almost full
Logged by: Baseboard Management Controller;
Sensor: Event Logging Disabled
Data2: PRV State: 0x1FOEM Code2: 0x64
0x204ADCD7FA023FF0 641F647000100300


Log Entry 521:
Alert Level 2: Informational
Keyword: MEM_SELFTEST_MBE_IN_RANK
uncorrectable ECC error in DIMM during selftest
Logged by: System Firmware 0
Data: Location - Memory (SIMM or DIMM): DIMM Slot 0x5A, Extender 0
0x44800B9000E03FE0 FFFFFFFF005AFF74
Brent Henderson
Occasional Visitor

Re: bl860c memory issue

I believe the most expedient way to find the bad DIMMs is covered in the last posting. I have had success this morning getting the system to the EFI shell by powering off the system, clearing the system logs, and then powering back on. Not sure if this is specific to my situation or not, but it might be something to try if someone else has a misbehaving system that won't even get to the EFI prompt for more debugging. :)

1. login to the MP and type 'cm' for the command menu.
2. Then type 'pc' for the power control menu.
3. From there, type 'off' to turn the system off.
4. Then type 'ma' to go back to the main menu.
5. Now type 'sl' to enter the show logs menu.
6. Select 'c' to clear all logs and confirm that you want to do it.
7. Now back to the main menu with a 'Q' to quit (^b will also work).
8. Then back into the power control menu with a 'pc'.
9. Finally turn the system back on with a 'on' and confirm it.