Integrity Servers
cancel
Showing results for 
Search instead for 
Did you mean: 

memory dimm problem rx7640

 
likid0
Honored Contributor

memory dimm problem rx7640

Hi,

We have a rx7640 server, that is freezing very frequently,it leaves no crash, even if we do a toc from th MP, it doesn't leave mca files in the /var/tombstones dir.

We had some dimm erros in stm so we changed the dimms with errors: 1a,1b,3a,3b

after a pdt clear, we did a reset and booted the server, the problem is the errors are still in stm:



===========================================================================

Memory Error Log Summary

DIMM Location Error Address Error Type Page Count
---------------------- ---------------- ---------- ------------- -----
Cab 0 Cell 1 DIMM 1A 0x372b1c500 Single-Bit 0x372b1c 1
Cab 0 Cell 1 DIMM 1A 0x13f1a4100 Single-Bit 0x13f1a4 1
Cab 0 Cell 1 DIMM 1A 0x11e2a7500 Single-Bit 0x11e2a7 1
Cab 0 Cell 1 ECHELON 3 0x107b46000 Multi-Bit 0x107b46 N/A
Cab 0 Cell 1 DIMM 1A 0x40ff1a2100 Single-Bit 0x40ff1a2 1
Cab 0 Cell 1 DIMM 1A 0x40fdea6500 Single-Bit 0x40fdea6 1
Cab 0 Cell 1 DIMM 1A 0x40df1ae100 Single-Bit 0x40df1ae 1
Cab 0 Cell 1 DIMM 1A 0x40df1a4500 Single-Bit 0x40df1a4 1
Cab 0 Cell 1 DIMM 1A 0x40d231ad00 Single-Bit 0x40d231a 1
Cab 0 Cell 1 DIMM 1A 0x3dea0100 Single-Bit 0x3dea0 1
Cab 0 Cell 1 DIMM 1A 0x3231c100 Single-Bit 0x3231c 1

System start: Wed Jul 14 01:15:13 2010.
Last error detected: Wed Jul 14 01:15:13 2010.
Logging interval: 900 seconds.
11 address(es) with errors logged in memory error log.

The Logtool Utility provides full details about the memory error log.

Page Deallocation Table (PDT)

DIMM Location Error Address Error Type Page
---------------------- ---------------- ---------- -------------
Cab 0 Cell 1 ECHELON 3 0x107b46000 Multi-Bit 0x107b46

PDT Entries Used: 1
PDT Entries Free: 199
PDT Total Size: 200


I have several questions.

Do you have to run anymore commands to get rid of the old memory errors on stm?

What does
Cab 0 Cell 1 ECHELON 3 mean?, dimms 3a and 3b?

If I changed the dimms, and still get the same errors in the same dimms, could it be the cell board?





in the FPL i have:65126 SFW 0,1,0 0 03000f1d10e00000 0000000000000003 SAL_INFO_REC_CLEAR
65125 SFW 0,1,0 0 03000f1c10e00000 0000013000000003 SAL_INFO_REC
65124 SFW 0,1,0 0 03000f1c10e00000 0000013000000003 SAL_INFO_REC
65123 SFW 0,1,0 0 03000f1d10e00000 0000000000000003 SAL_INFO_REC_CLEAR
65122 SFW 0,1,0 0 03000f1c10e00000 0000013000000003 SAL_INFO_REC
65121 SFW 0,1,0 0 03000f1c10e00000 0000013000000003 SAL_INFO_REC
65120 SFW 0,1,0 0 03000f1d10e00000 0000000000000003 SAL_INFO_REC_CLEAR
65119 SFW 0,1,0 0 03000f1c10e00000 0000013000000003 SAL_INFO_REC
65118 SFW 0,1,0 0 03000f1c10e00000 0000013000000003 SAL_INFO_REC
65117 SFW 0,1,0 0 03000f1d10e00000 0000000000000003 SAL_INFO_REC_CLEAR
65116 SFW 0,1,0 0 03000f1c10e00000 0000013000000003 SAL_INFO_REC
65115 SFW 0,1,0 0 03000f1c10e00000 0000013000000003 SAL_INFO_REC
65114 SFW 0,1,4 0 03000f1d14e00000 0000000000000000 SAL_INFO_REC_CLEAR
65113 SFW 0,1,4 1 2e000a6314e00000 0000000000000000 NO_UNCONSUMED_LOGS_FOUND
65112 SFW 0,1,6 0 03000f1d16e00000 0000000000000000 SAL_INFO_REC_CLEAR
65111 SFW 0,1,6 1 2e000a6316e00000 0000000000000000 NO_UNCONSUMED_LOGS_FOUND
65110 SFW 0,1,2 0 03000f1d12e00000 0000000000000000 SAL_INFO_REC_CLEAR
65109 SFW 0,1,2 1 2e000a6312e00000 0000000000000000 NO_UNCONSUMED_LOGS_FOUND
65108 SFW 0,1,0 0 0300122810e00000 0000000000000000 SAL_SET_VECTORS_CS1
65107 SFW 0,1,0 0 0300122710e00000 0000000000000002 SAL_SET_VECTORS_TYPE
65106 214c40e4da020000 ff0f066f001f0300 IPMI Type-02 Event
65106 07/16/2010 23:01:46




Thnx
Windows?, no thanks
2 REPLIES 2
SoorajCleris
Honored Contributor

Re: memory dimm problem rx7640

Hi

What does
Cab 0 Cell 1 ECHELON 3 mean?, dimms 3a and 3b?

Yes.
New memory term "Echelon" replaces "Ranks"
An Echelon refers to the smallest allocate-able memory unit.

==> If I changed the dimms, and still get the same errors in the same dimms, could it be the cell board?

Do you have a chance to intechange the modules from other cells. Still you are facing issue, you may need to replace the cellboard.

Regards,
Sooraj
"UNIX is basically a simple operating system, but you have to be a genius to understand the simplicity" - Dennis Ritchie
likid0
Honored Contributor

Re: memory dimm problem rx7640

So when you a have an error on echelon 3, like this multibit error:

DIMM Location Error Address Error Type Page
---------------------- ---------------- ---------- -------------
Cab 0 Cell 1 ECHELON 3 0x107b46000 Multi-Bit 0x107b46

It you should change dimms, in slots 3a and 3b ?

We finally replaced the cell and 2 cpus.

because after replacing the memory the machine still kept freezing....
Windows?, no thanks