HPE 9000 and HPE e3000 Servers
1751893 Members
5318 Online
108783 Solutions
New Discussion юеВ

Re: K380 HPUX 11.11

 
Lubo
Advisor

K380 HPUX 11.11

Oct 23, 2003 08:26:52 GMT

--------------------------------------------------------------------------------
Hello,
problem is
/system/events/memory/49 is >= 3.
Its current value is SERIOUS(4).

Event data from monitor:

Event Time..........: Thu Oct 23 09:11:34 2003
Severity............: SERIOUS
Monitor.............: dm_memory
Event #.............: 3200


Summary:
Memory Event Type : Single bit error (SBE) event. A correctable single
bit error has been detected and logged.
Description of Error:

The memory component:

Cab/Cell or Node: 0
MC/EXT: 0
DIMM: 3a

Client Configuration File Version...: A.01.00
Qualification criteria met.
Number of events..: 1000
Received within...: 2 day(s)
Associated OS error log entry id(s):
None
Additional System Data:
System Model Number.............: 9000/800
EMS Version.....................: A.03.20
STM Version.....................: A.35.00
Latest information on this event:
http://docs.hp.com/hpux/content/hardware/ems/dm_memory.htm#3200

v-v-v-v-v-v-v-v-v-v-v-v-v D E T A I L S v-v-v-v-v-v-v-v-v-v-v-v-v



Component Data:
Physical Device Path....: 49
Tag 2...................: 20


I had this problem on 3a DIMM 2 months ago and I put new DIMM. Then after 1 month I had problem on 1b DIMM I changed it with 1a. Then after 7 weeks I had problem with 0b DIMM and I changed Memory Carrier.
Now I have again problem as you can see with 3a DIMM.
What to do next ?????
Thanks for help.
6 REPLIES 6
Stefan Stechemesser
Honored Contributor

Re: K380 HPUX 11.11

Hi,

normally there is no need to change dimms due to single bit errors (SBE). EMS takes care that an entry in the page dallocation table (PDT) is made and after the next reboot no more error will happen on this address. A dimm has only to be exchanged if massive SBEs occur on several addresses on one dimm (also after a reboot). SBEs are corrected (ECC) without any impact on the performance.
Lubo
Advisor

Re: K380 HPUX 11.11

But it is 2 times per day, this is no reason to change it or something to do?
Thanks
Stefan Stechemesser
Honored Contributor

Re: K380 HPUX 11.11

Single bit errors (SBE) are corrected AFTER they have been read from the memory Dimm. This is because in this way both bit flips in the dimm and transfer errors (f.e. a bad IC connection) are corrected. This means, if a bit is flipped (and there are several reasons why this can happen, not nessecary a bad dimm) the bad byte stays in memory until it is overwritten with new data. If it happens to be a byte in a static part of the memory, f. e. kernel memory or code, you get a SBE on every access to the bad byte until the next reboot.

How can you determine if it is such an uncritical error or a hardware problem ?

If you have a hardware problem, you should see errors many different adresses (use information tool on memory from stm or much better: logtool=> memory log).

If you see thousands of errors on the same address, it may be that the errors do not happen again after a reboot, because it was simply a bit flip.

I hope this clarifies why EMS may inform you several times about SBE errors until you reboot the system.
Jeff Schussele
Honored Contributor

Re: K380 HPUX 11.11

Hi Lubo,

There's a known issue with Ks, Ts & Vs & the version of Support Tools Manager.
11i patches haves been issued:

PHSS_29343 for the Mar 2003 STM
and
PHSS_29344 for the June 2003 STM

You'll know you have this problem if EMS keeps reporting that these memory addresses are in locked memory, when in fact they are not. This is due to a miscalculated memory address.

HTH,
Jeff
PERSEVERANCE -- Remember, whatever does not kill you only makes you stronger!
Lubo
Advisor

Re: K380 HPUX 11.11

So if it is bug of stm is it OK when it is in PDC in pdt ????
Jeff Schussele
Honored Contributor

Re: K380 HPUX 11.11

Well, I wouldn't say it's "OK".
What's happening - if you have the bug - is that normally the page would be deallocated, but since Diags is incorrectly decoding the address to a locked area, it cannot deallocate the page until next reboot. So you keep getting & getting & getting the message when it should have been deallocated on the fly long ago.
I've seen this more on the V-class, but the problem can exist on Ks & Ts as well.

To be safe, i'd place a call to the RC & log a HW call on this. They'll be able to determine whether you have the bug or not. Remember that you could also have a memory location that is indeed in locked memory & they *cannot* be deallocated on the fly - only at boot time.

Rgds,
Jeff
PERSEVERANCE -- Remember, whatever does not kill you only makes you stronger!