Integrity Servers
cancel
Showing results for 
Search instead for 
Did you mean: 

rx2620 Memory problem

 
Rikki hinn Ogurlegi
Frequent Advisor

rx2620 Memory problem

Hello all,

 

I have a rx2620 that has been running OpenVMS for the last 7 years.   This morning it died and will not boot again.

Nobody touched the machine.   The MP log tells me the machine seems to not see any memory at all.

 

Info about the machine:

 

redb2] MP:CM> sysrev


SYSREV

Current firmware revisions

MP FW     : E.03.30
BMC FW    : 04.01
System FW : 04.10


 

The System Event log has:

 

363   SFW  0   0  0x148002C500E02180 0000000000000000 BOOT_REBOOT
                                                      30 Aug 2013 09:19:04
364   SFW  0  *3  0x64800FA000E021A0 FFFFFFFF003BFF74 MEM_CHIPSPARE_DEALLOC_RANK
                                                      30 Aug 2013 09:19:11
365   SFW  0  *3  0x64800FA000E021C0 FFFFFFFF001BFF74 MEM_CHIPSPARE_DEALLOC_RANK
                                                      30 Aug 2013 09:19:11
366   SFW     *5  0xC15220638F0221E0 FF3F4070000F0300 Type-02 0f7000 1011712
                                                      30 Aug 2013 09:19:11
367   SFW  0  *7  0xE08000D100E021F0 0000000000000000 MEM_NO_MEM_FOUND
                                                      30 Aug 2013 09:19:11
368   SFW     *5  0xC15220638F022210 FF3F4070000F0300 Type-02 0f7000 1011712
                                                      30 Aug 2013 09:19:11
369   SFW  0  *7  0xF480003700E02220 000000000000000F BOOT_HALT_CELL
                                                      30 Aug 2013 09:19:11
370   BMC      2  0x20522063EF022240 0180A37000120300 Type-02 127003 1208323
                                                      30 Aug 2013 09:20:47
371   BMC      2  0x2000000001022250 0150A17000120300 Type-02 127001 1208321
                                                                  00:00:01
372   BMC      2  0x2000000007022260 FFFF006F01050300 Type-02 056f00 356096
                                                                  00:00:07

373   BMC     *3  0x2000000007022270 FFFF010302050300 Type-02 050301 328449
                                                                  00:00:07
374   BMC      2  0x2000000007022280 FFFF018302050300 Type-02 058301 361217
                                                                  00:00:07
375   MP   0   2  0x5E800A7A00E02290 0000000000000000 MP_SELFTEST_RESULT
                                                                  00:00:11
376   BMC      2  0x20522066390222B0 FFFF0103FDC00300 Type-02 c00301 12583681
                                                      30 Aug 2013 09:30:33
377   BMC      2  0x205220663B0222C0 FFFF006F04140300 Type-02 146f00 1339136
                                                      30 Aug 2013 09:30:35
378   BMC      2  0x205220663B0222D0 0401A37004120300 Type-02 127003 1208323
                                                      30 Aug 2013 09:30:35
379   BMC      2  0x205220663E0222E0 FFFF027000120300 Type-02 127002 1208322
                                                      30 Aug 2013 09:30:38
380   BMC      2  0x205220663F0222F0 FFFF0108F10D0300 Type-02 0d0801 854017
                                                      30 Aug 2013 09:30:39
381   BMC      2  0x205220663F022300 FFFF0108F20D0300 Type-02 0d0801 854017
                                                      30 Aug 2013 09:30:39
382   BMC      2  0x205220663F022310 FFFF0108F30D0300 Type-02 0d0801 854017
                                                      30 Aug 2013 09:30:39
383   BMC      2  0x205220663F022320 FFFF006FFA220300 Type-02 226f00 2256640
                                                      30 Aug 2013 09:30:39
384   SFW      2  0xC152206648022330 FFFF000A001D0300 Type-02 1d0a00 1903104
                                                      30 Aug 2013 09:30:48
385   SFW  0   1  0x5480006300E02340 0000000000000000 BOOT_START
                                                      30 Aug 2013 09:30:48
386   BMC      2  0x2052206659022360 FFFF027000120300 Type-02 127002 1208322
                                                      30 Aug 2013 09:31:05
387   SFW  0   0  0x148002C500E02370 0000000000000000 BOOT_REBOOT
                                                      30 Aug 2013 09:31:13
388   SFW  0  *3  0x64800FA000E02390 FFFFFFFF003BFF74 MEM_CHIPSPARE_DEALLOC_RANK
                                                      30 Aug 2013 09:31:23
389   SFW  0  *3  0x64800FA000E023B0 FFFFFFFF001BFF74 MEM_CHIPSPARE_DEALLOC_RANK
                                                      30 Aug 2013 09:31:23
390   SFW     *5  0xC15220666B0223D0 FF3F4070000F0300 Type-02 0f7000 1011712
                                                      30 Aug 2013 09:31:23
391   SFW  0  *7  0xE08000D100E023E0 0000000000000000 MEM_NO_MEM_FOUND
                                                      30 Aug 2013 09:31:23
392   SFW     *5  0xC15220666B022400 FF3F4070000F0300 Type-02 0f7000 1011712
                                                      30 Aug 2013 09:31:23
393   SFW  0  *7  0xF480003700E02410 000000000000000F BOOT_HALT_CELL
                                                      30 Aug 2013 09:31:23
394   BMC      2  0x20522067D2022430 FFFF006F04140300 Type-02 146f00 1339136
                                                      30 Aug 2013 09:37:22
395   BMC      2  0x20522067D7022440 040EA37004120300 Type-02 127003 1208323
                                                      30 Aug 2013 09:37:27

 

According to some googling, this seems to be caused by either memory that is installed incorrectly or a bad dimm.

Since this machine has not been opened in 7 years or so, I hope it is the latter.   But I guess anyting could be wrong.

 

Can anybody here see what the issue is?

 

Thanks in advance,

Richard.

5 REPLIES 5
Rikki hinn Ogurlegi
Frequent Advisor

Re: rx2620 Memory problem

Going further back in the log reveals lots and lots of:

 

321   SFW      2  0xC1520B6D7C021DD0 108F6070830C0300 Type-02 0c7000 815104
                                                      14 Aug 2013 11:43:56
322   SFW  0   2  0x448000A700E01DE0 FFFFFFFF002BFF74 MEM_CORR_ERR
                                                      14 Aug 2013 11:43:56
323   SFW      2  0xC1520BA23B021E00 108F6070830C0300 Type-02 0c7000 815104
                                                      14 Aug 2013 15:28:59
324   SFW  0   2  0x448000A700E01E10 FFFFFFFF002BFF74 MEM_CORR_ERR
                                                      14 Aug 2013 15:28:59

and:

 

325   SFW  0   2  0x408002B600E01E30 0000000000000000 MEM_PDT_SBE_PROMOTE
                                                      14 Aug 2013 15:28:59

But I cant tell which DIMM has the issue from this.

 

Robert_Jewell
Honored Contributor

Re: rx2620 Memory problem

But I cant tell which DIMM has the issue from this.

 

The DIMM location in these logs is 2B:

 

FFFFFFFF002BFF74

 

00 = cell or board

2B = DIMM slot

 

Also know that the rx2620 utilizes DIMMs in quads, so if you remove 2B you will need to remove 2A, 3A and 3B as well.

 

-Bob

----------------
Was this helpful? Like this post by giving me a thumbs up below!
Matti_Kurkela
Honored Contributor

Re: rx2620 Memory problem

You're viewing the System Event Log in Keyword Mode, which is the default mode. Switch the log viewer to Text Mode, and you may get the error messages in a more verbose form.

 

In a rx2620, memory must be installed in quads, i.e. sets of 4 DIMMs. There are 12 DIMM slots, i.e. 3 quads in total. (However, if 4 GB DIMMs are used, only 2 quads can be populated.)

 

MEM_CORR_ERR is a memory error that is correctable by the ECC subsystem. An uncorrectable memory error would cause the system to immediately crash. The system maintains a persistent Page Deallocation Table (PDT) that will be used to lock out the memory areas that are producing a lot of errors.

 

MEM_CHIPSPARE_DEALLOC_RANK indicates the system is rejecting a DIMM (or a quad), either because of an outright failure or because it has a too high frequency of correctable memory errors. Your system has two rejection messages per boot attempt: if those two rejected DIMMs are in different quads, and there are only 2 quads of DIMMs installed, it would mean that there are no more "good" quads available in the system. That might explain why the system cannot boot.

MK
Robert_Jewell
Honored Contributor

Re: rx2620 Memory problem

That is exactly what looks like is going on Matti....

 

364   SFW  0  *3  0x64800FA000E021A0 FFFFFFFF003BFF74 MEM_CHIPSPARE_DEALLOC_RANK

365   SFW  0  *3  0x64800FA000E021C0 FFFFFFFF001BFF74 MEM_CHIPSPARE_DEALLOC_RANK

 

DIMMs 3B and 1B are being deallocated and each of those reside in a seperate quad.  Likely the only two.

 

What I would suggest Richard, is to make up one quad of DIMMs for slots 0A/B and 1A/B to get the system to boot.  Just leave out the DIMMs that are currently in 1B, 3B and now 2B.  Once you get replacement memory DIMMs you can install them back as needed.

 

-Bob

----------------
Was this helpful? Like this post by giving me a thumbs up below!
Rikki hinn Ogurlegi
Frequent Advisor

Re: rx2620 Memory problem

Ah, thank you all so much.   It's good to finally understand those messages.  I have over a years worth of the event log in a text file on my latop and now it is plain as day what is going on:

 

[ra@hamburger ~]$ grep MEM_CHIPSPARE_DEALLOC_RANK event.log
338   SFW  0  *3  0x64800FA000E01F40 FFFFFFFF003BFF74 MEM_CHIPSPARE_DEALLOC_RANK
339   SFW  0  *3  0x64800FA000E01F60 FFFFFFFF001BFF74 MEM_CHIPSPARE_DEALLOC_RANK
351   SFW  0  *3  0x64800FA000E02070 FFFFFFFF003BFF74 MEM_CHIPSPARE_DEALLOC_RANK
352   SFW  0  *3  0x64800FA000E02090 FFFFFFFF001BFF74 MEM_CHIPSPARE_DEALLOC_RANK
364   SFW  0  *3  0x64800FA000E021A0 FFFFFFFF003BFF74 MEM_CHIPSPARE_DEALLOC_RANK
365   SFW  0  *3  0x64800FA000E021C0 FFFFFFFF001BFF74 MEM_CHIPSPARE_DEALLOC_RANK
388   SFW  0  *3  0x64800FA000E02390 FFFFFFFF003BFF74 MEM_CHIPSPARE_DEALLOC_RANK
389   SFW  0  *3  0x64800FA000E023B0 FFFFFFFF001BFF74 MEM_CHIPSPARE_DEALLOC_RANK

 

1B and 3B deactivated and yes since the system has 8 dimms, all memory is now gone.

 

And then there is the 3rd dimm that is giving headaches:

 

[ra@hamburger ~]$ grep -c MEM_CORR_ERR event.log
145

[ra@hamburger ~]$ grep  MEM_CORR_ERR event.log | grep -v FFFFFFFF002BFF74 | wc -l
0

Atleast all the other dimms seem to be fine.

 

Again, thanks for your help :)