Operating System - HP-UX
1833315 Members
2901 Online
110051 Solutions
New Discussion

Re: EMS; single bit errors

 
SOLVED
Go to solution
Shivkumar
Super Advisor

EMS; single bit errors

Dear Sirs,

We are using the EMS hardware memory monitors on HPUX.

We are getting alert as shown below:-


>------------ Event Monitoring Service Event Notification ------------<


/system/events/memory/192 is >= 1.
Its current value is CRITICAL(5).




Summary:
Memory Event Type : Single bit error (SBE) event. A correctable single
bit error has been detected and logged.


Description of Error:

The memory component:

Cab/Cell or Node: 0
MC/EXT: 0
DIMM: 0b

is experiencing an excessive number of single bit errors.

Probable Cause / Recommended Action:

Although the single bit errors are being corrected, it is strongly advisable
to closely monitor the situation. This condition indicates a potential
problem. Contact your HP support representative to check the memory boards.


Additional System Data:
System Model Number.............: 9000/800/L3000-8x
EMS Version.....................: A.04.00
STM Version.....................: A.38.00



Someone suggested that by rebooting the system this error might go away.

Can anyone suggest some other available option ?

Thanks,
Shiv
18 REPLIES 18
Raj D.
Honored Contributor
Solution

Re: EMS; single bit errors

Hi Shivkumar ,

Are you getting frequently , this error , "Single bit error (SBE) event" in syslog.log , The error related to physical memory.

You can ignore this for once , and keep monitoring.

Cheers,
Raj.
" If u think u can , If u think u cannot , - You are always Right . "
Shivkumar
Super Advisor

Re: EMS; single bit errors

Raj;

Yes. i am getting frequently this error. Some folks are suggesting for rebooting the server. I am considering it as a last option.

Thanks,
Shiv
morganelan
Trusted Contributor

Re: EMS; single bit errors

Raj D.
Honored Contributor

Re: EMS; single bit errors

Hi Shiv ,

It seems to be Firmware or Patch needs to be update along with STM .


check out the above link , for detail.

hth
Raj.
" If u think u can , If u think u cannot , - You are always Right . "
Ranjith_5
Honored Contributor

Re: EMS; single bit errors

Shiva,

If you are continuously getting the SBE then try replacing the DIMM. Dont take any chance if this is your production machine.

Regards,
Syam
Ranjith_5
Honored Contributor

Re: EMS; single bit errors

Hi Shiva,

The link given by morganelan explains about a known issue of V Class server and STM. in your case since it is a L3000 I think this shouldnt be an issue but a real problem with the memory module of your machine. Any way there is nothing to loose in observing the problem after replacing the module.

Regards,
Syam
Devender Khatana
Honored Contributor

Re: EMS; single bit errors

Hi,

The error is related to memory and rebooting will not change the behaviour. The frequency of the error is key to decide wheather to do something or not. If the frequency of error is more than a few times in a day and is on one prticular DIMM then replace it.

Closely monitor all EMS alerts to see on which DIMMs the errors are reported.

HTH,
Devender
Impossible itself mentions "I m possible"
Shivkumar
Super Advisor

Re: EMS; single bit errors

Our server model is "9000/800/L3000-8x". Not sure which class server is this.
Raj D.
Honored Contributor

Re: EMS; single bit errors

Its RP series server.
You can log a call to, to find out if a DIMM is really need to replace , for any associated error.

You can run this command to find out DIMM layout details :

# echo "selclass qualifier memory;info;wait;infolog" |/usr/sbin/cstm



Cheers,
Raj.
" If u think u can , If u think u cannot , - You are always Right . "
Ranjith_5
Honored Contributor

Re: EMS; single bit errors

Hi Shiva,

Your server is L3000. Your model command output already describes that. By the way the module which is giving SBE is in the slot 0b as per the EMS. You may need to replace it to confirm the problem. Get more information about your DIMM with the following cstm command. This will give you the memory board inventry with the memory log summary incase of any.

#echo "selclass qualifier memory;info;wait;infolog" | cstm


A typicalout put of this command is as follows.

asic Memory Description

Module Type: MEMORY
Total Configured Memory : 4096 MB
Page Size: 4096 Bytes

Memory interleaving is supported on this machine and is ON.

Memory Board Inventory

DIMM Slot Size (MB)
--------- ---------
0a 512
2a 512
1a 512
3a 512
0b 512
2b 512
1b 512
3b 512
--------- ---------
System Total (MB): 4096

Memory Error Log Summary

The memory error log is empty.



Regards,
Syam

Devender Khatana
Honored Contributor

Re: EMS; single bit errors

Hi,

Replacing the module in slot 0b will resolve the issue if the error is always repeating for only this module. Still notice the frequency period. If it is not much it is not required to replace the part.

Also if the part need to be replaced and it is not available, then the system can be run with some low amount of memory by removing a pair of two modules.

HTH,
Devender
Impossible itself mentions "I m possible"
morganelan
Trusted Contributor

Re: EMS; single bit errors

Single-bit memory errors are handled exclusively by memlogd. This allows the system to remove lockable pages that experience repeated single-bit memory errors. At boot time, the system uses the Page Deallocation Table to remove these pages dynamically from the kernel's list of free pages.

Kamal Mirdad
Shivkumar
Super Advisor

Re: EMS; single bit errors

In the report of memory i saw some below lines:--

Preparing the Information Tool Log for MEMORY on path 192 File ...

.... htx655.cce.hp.com : 16.110.74.16 ....

-- Information Tool Log for MEMORY on path 192 --

Log creation time: Sat Sep 3 23:38:38 2005

Hardware path: 192


What is the hardware path:192 stands for ?

Thanks,
Shiv
Raj D.
Honored Contributor

Re: EMS; single bit errors

Hi Shiv ,

Hardware PATH 192 , mean this is the Physical path for the Memory location , where the DIMM modules are connected.

If you give # ioscan -fn , you can see the same location , where the Memory modules are connected.

You can see like that.

memory 0 192 memory CLAIMED MEMORY Memory

cstm also show the same.

Cheers ,
Raj.
" If u think u can , If u think u cannot , - You are always Right . "
Devender Khatana
Honored Contributor

Re: EMS; single bit errors

Hi,

It corresponds to the hardware path of the physical location of the memory. It is mapped to a dev number in cstm through "map all" command. The output if you take complete output would be similar to this.

host1:/>/home/hp>>cstm
Running Command File (/usr/sbin/stm/ui/config/.stmrc).

-- Information --
Support Tools Manager


Version A.45.00

Product Number B4708AA

(C) Copyright Hewlett Packard Co. 1995-2002
All Rights Reserved

Use of this program is subject to the licensing restrictions described
in "Help-->On Version". HP shall not be liable for any damages resulting
from misuse or unauthorized use of this program.

cstm>map all


Dev Last Last Op
Num Path Product Active Tool Status
=== ==================== ========================= =========== =============
* 1 system system () Information Successful
* 2 0 Bus Adapter (803) Information Successful
* 3 0/0 PCI Bus Adapter (782) Information Successful
* 4 0/0/0/0 Core PCI 100BT Interface Information Successful
* 5 0/0/1/0 PCI SCSI Interface (10000 Information Successful
* 6 0/0/1/0.3.0 SCSI Tape (QUANTUMDLT8000 Information Incomplete
* 7 0/0/1/1 PCI SCSI Interface (10000 Information Successful
* 8 0/0/1/1.2.0 SCSI Disk (HP36.4GST33675 Information Successful
* 9 0/0/2/0 PCI SCSI Interface (10000 Information Successful
* 10 0/0/2/0.2.0 SCSI Disk (HP36.4GST33675 Information Successful
* 11 0/0/2/1 PCI SCSI Interface (10000 Information Successful
* 12 0/0/2/1.2.0 Optical Storage Device (H Information Warning
* 13 0/0/4/1 RS-232 Interface (103c104 Information Successful
* 14 0/1 PCI Bus Adapter (782) Information Successful
* 15 0/2 PCI Bus Adapter (782) Information Successful
* 16 0/3 PCI Bus Adapter (782) Information Successful
* 17 0/4 PCI Bus Adapter (782) Information Successful
* 18 0/4/0/0 Fibre Channel Interface ( Information Successful
* 19 0/4/0/0.8 Fibre Channel Driver (Mas
* 20 0/4/0/0.8.0.2.0.0 XP Array (HPOPEN-E)
* 21 0/4/0/0.8.0.2.0.1 XP Array (HPOPEN-E)
* 22 0/4/0/0.8.0.2.0.2 XP Array (HPOPEN-E)
* 23 0/4/0/0.8.0.2.0.3 XP Array (HPDISK-SUBSYSTE
* 24 0/4/0/0.8.0.2.0.4 XP Array (HPDISK-SUBSYSTE
* 25 0/4/0/0.8.0.2.0.5 XP Array (HPDISK-SUBSYSTE
* 26 0/4/0/0.8.0.2.0.6 XP Array (HPDISK-SUBSYSTE
* 27 0/4/0/0.8.0.2.0.7 XP Array (HPDISK-SUBSYSTE
* 28 0/4/0/0.8.0.2.0.8 XP Array (HPDISK-SUBSYSTE
* 29 0/4/0/0.8.0.2.0.9 XP Array (HPDISK-SUBSYSTE
* 30 0/4/0/0.8.0.2.0.10 XP Array (HPDISK-SUBSYSTE
* 31 0/4/0/0.8.0.2.0.11 XP Array (HPDISK-SUBSYSTE
* 32 0/4/0/0.8.0.2.0.12 XP Array (HPDISK-SUBSYSTE
* 33 0/4/0/0.8.0.2.0.13 XP Array (HPDISK-SUBSYSTE
* 34 0/4/0/0.8.0.2.0.14 XP Array (HPDISK-SUBSYSTE
* 35 0/4/0/0.8.0.2.0.15 XP Array (HPDISK-SUBSYSTE
* 36 0/4/0/0.8.0.2.1.0 XP Array (HPDISK-SUBSYSTE
* 37 0/4/0/0.8.0.2.1.1 XP Array (HPDISK-SUBSYSTE
* 38 0/4/0/0.8.0.2.1.2 XP Array (HPDISK-SUBSYSTE
* 39 0/4/0/0.8.0.2.1.3 XP Array (HPDISK-SUBSYSTE
* 40 0/4/0/0.8.0.2.1.4 XP Array (HPDISK-SUBSYSTE
* 41 0/4/0/0.8.0.2.1.5 XP Array (HPDISK-SUBSYSTE
* 42 0/4/0/0.8.0.2.1.6 XP Array (HPDISK-SUBSYSTE
* 43 0/4/0/0.8.0.2.1.7 XP Array (HPDISK-SUBSYSTE
* 44 0/4/0/0.8.0.2.1.8 XP Array (HPDISK-SUBSYSTE
* 45 0/4/0/0.8.0.2.1.9 XP Array (HPDISK-SUBSYSTE
* 46 0/4/0/0.8.0.2.1.10 XP Array (HPDISK-SUBSYSTE
* 47 0/4/0/0.8.0.2.1.11 XP Array (HPDISK-SUBSYSTE
* 48 0/4/0/0.8.0.2.1.12 XP Array (HPDISK-SUBSYSTE
* 49 0/4/0/0.8.0.2.1.13 XP Array (HPDISK-SUBSYSTE
* 50 0/4/0/0.8.0.2.1.14 XP Array (HPDISK-SUBSYSTE
* 51 0/4/0/0.8.0.2.1.15 XP Array (HPDISK-SUBSYSTE
* 52 0/4/0/0.8.0.255.0.2 XP Array (HPOPEN-XP512)
* 53 0/5 PCI Bus Adapter (782) Information Successful
* 54 0/8 PCI Bus Adapter (782) Information Successful
* 55 0/8/0/0 PCI 100 BaseT LAN Interfa Information Successful
* 56 0/9 PCI Bus Adapter (782) Information Successful
* 57 0/9/0/0 PCI 100 BaseT LAN Interfa Information Successful
* 58 0/10 PCI Bus Adapter (782) Information Successful
* 59 0/10/0/0 PCI Gigabit Ethernet Link Information Successful
* 60 0/12 PCI Bus Adapter (782) Information Successful
* 61 0/12/0/0 PCI Gigabit Ethernet Link Information Successful
* 62 33 CPU (5df) Information Successful
* 63 97 CPU (5df) Information Successful
* 64 192 MEMORY (90) Information Successful
cstm>

HTH,
Devender
Impossible itself mentions "I m possible"
Mahesh Kumar Malik
Honored Contributor

Re: EMS; single bit errors

Hi Shiv

1. You have to keep on monitoring this error. Although single bit error is correctable but it is always not a desired state. You should plan replacement of DIMM 0b ASAP

2. Rebooting the server will not help

Regards
Mahesh
Andrew Merritt_2
Honored Contributor

Re: EMS; single bit errors

Hi Shiv,
As the severity level of the event is Critical, that implies that the number of errors is very high. What's the actual event number that is being reported 4200, or 4500?
Were the previous events of the same Event number, or of lesser severity? (4200 indicates 120 occurrences in 24 hours, 4500 means 200 in 7 days.)

What does the STM Memory Info tool show? What entries does it show in the PDT?

I'd also suggest upgrading the OnlineDiags to the latest version, as A.38.00 is fairly old. The current version for HPUX 11.11 is A.47.00, plus PHSS_32924. There have been some changes and fixes regarding the reporting of SBEs by dm_memory since A.38.00. One change is that if it's exactly the same address that's at fault each time, rather than different addresses on the same SBE, the error is normally ignored, since it's probably a 'stuck at' error. For reasons I don't fully understand, this doesn't warrant replacing. The detection of whether the SBEs are at the same or different addresses on a single DIMM has been improved since A.38.00.

If it is a 4200 or 4500 event, the number of SBEs is a cause for concern, and the indicated DIMM should probably be replaced.

Andrew

Ranjith_5
Honored Contributor

Re: EMS; single bit errors

Shiva,

Sorry for my late reply.I was not there for the last 2 days. Cudn't logon to ITRC. Threre is no need to get confused in your case. The error is confirmed that it is a SBE. You can directly try replacing the memory module.


Regards,
Syam