1752799 Members
6355 Online
108789 Solutions
New Discussion юеВ

Problem with the RAM

 
SOLVED
Go to solution
Andrikopoulos
Occasional Contributor

Problem with the RAM

Hi all,

I have a server L1000 HP-UX 11.11 and i faced really big problems on the past. I end up to the conclusion that there was a memory problem and on the beginning of May i changed them. The problem now are less but I got this log output . I think that there is memory i should change on slot 2a/b. Could you please give me any more suggestions pls... i am giving you below some output of the log file.
Thank you.





>------------ Event Monitoring Service Event Notification ------------<

Notification Time: Tue May 30 08:25:13 2006

sathes4 sent Event Monitor notification information:

/system/events/memory/8 is >= 1.
Its current value is MAJORWARNING(3).



Event data from monitor:

Event Time..........: Tue May 30 08:25:13 2006
Severity............: MAJORWARNING
Monitor.............: dm_memory
Event #.............: 4000


Summary:
Memory Event Type : Single bit error (SBE) event. A correctable single
bit error has been detected and logged.


Description of Error:

The memory component:

Cab/Cell or Node: 0
MC/EXT: 0
DIMM: 2b
Serial Number: N/A
Part Number: N/A

is experiencing correctable single bit errors (SBE) on a single
component.

Probable Cause / Recommended Action:

Although the single bit errors are being corrected, it may be advisable to
monitor the situation. If an excessive rate of single bit errors occur, an
event with higher severity will be generated.

Additional Event Data:
System IP Address...: 172.30.104.27
Event Id............: 0x447bd73900000000
Monitor Version.....: B.01.00
Event Class.........: I/O
Client Configuration File...........:
/var/stm/config/tools/monitor/default_dm_memory.clcfg
Client Configuration File Version...: A.01.00
Qualification criteria met.
Number of events..: 20
Received within...: 1 day(s)
Associated OS error log entry id(s):
None
Additional System Data:
System Model Number.............: 9000/800/L1000-36
EMS Version.....................: A.04.00
STM Version.....................: A.45.00
Latest information on this event:
http://docs.hp.com/hpux/content/hardware/ems/dm_memory.htm#4000

v-v-v-v-v-v-v-v-v-v-v-v-v D E T A I L S v-v-v-v-v-v-v-v-v-v-v-v-v



Component Data:
Physical Device Path....: 8
Tag 2...................: 20


>---------- End Event Monitoring Service Event Notification ----------<

>------------ Event Monitoring Service Event Notification ------------<

Notification Time: Tue May 30 09:22:05 2006

sathes4 sent Event Monitor notification information:

/system/events/memory/8 is >= 1.
Its current value is SERIOUS(4).



Event data from monitor:

Event Time..........: Tue May 30 09:22:05 2006
Severity............: SERIOUS
Monitor.............: dm_memory
Event #.............: 4100


Summary:
Memory Event Type : Single bit error (SBE) event. A correctable single
bit error has been detected and logged.


Description of Error:

The memory component:

Cab/Cell or Node: 0
MC/EXT: 0
DIMM: 2b
Serial Number: N/A
Part Number: N/A

is experiencing a high rate of correctable single bit errors on a
single component.

Probable Cause / Recommended Action:

Although the single bit errors are being corrected, it is advisable to
closely monitor the situation. If an excessive rate of single bit errors
occur, an event with higher severity will be generated.

Additional Event Data:
System IP Address...: 172.30.104.27
Event Id............: 0x447be48d00000000
Monitor Version.....: B.01.00
Event Class.........: I/O
Client Configuration File...........:
/var/stm/config/tools/monitor/default_dm_memory.clcfg
Client Configuration File Version...: A.01.00
Qualification criteria met.
Number of events..: 50
Received within...: 1 day(s)
Associated OS error log entry id(s):
None
Additional System Data:
System Model Number.............: 9000/800/L1000-36
EMS Version.....................: A.04.00
STM Version.....................: A.45.00
Latest information on this event:
http://docs.hp.com/hpux/content/hardware/ems/dm_memory.htm#4100

v-v-v-v-v-v-v-v-v-v-v-v-v D E T A I L S v-v-v-v-v-v-v-v-v-v-v-v-v




>---------- End Event Monitoring Service Event Notification ----------<

>------------ Event Monitoring Service Event Notification ------------<

Notification Time: Tue May 30 09:39:07 2006

sathes4 sent Event Monitor notification information:

/system/events/memory/8 is >= 1.
Its current value is MAJORWARNING(3).



Event data from monitor:

Event Time..........: Tue May 30 09:39:07 2006
Severity............: MAJORWARNING
Monitor.............: dm_memory
Event #.............: 4000


Summary:
Memory Event Type : Single bit error (SBE) event. A correctable single
bit error has been detected and logged.


Description of Error:

The memory component:

Cab/Cell or Node: 0
MC/EXT: 0
DIMM: 2a/b
Serial Number: N/A
Part Number: N/A

is experiencing correctable single bit errors (SBE) on a single
component.

Probable Cause / Recommended Action:

Although the single bit errors are being corrected, it may be advisable to
monitor the situation. If an excessive rate of single bit errors occur, an
event with higher severity will be generated.

Additional Event Data:
System IP Address...: 172.30.104.27
Event Id............: 0x447be88b00000000
Monitor Version.....: B.01.00
Event Class.........: I/O
Client Configuration File...........:
/var/stm/config/tools/monitor/default_dm_memory.clcfg
Client Configuration File Version...: A.01.00
Qualification criteria met.
Number of events..: 20
Received within...: 1 day(s)
Associated OS error log entry id(s):
None
Additional System Data:
System Model Number.............: 9000/800/L1000-36
EMS Version.....................: A.04.00
STM Version.....................: A.45.00
Latest information on this event:
http://docs.hp.com/hpux/content/hardware/ems/dm_memory.htm#4000

v-v-v-v-v-v-v-v-v-v-v-v-v D E T A I L S v-v-v-v-v-v-v-v-v-v-v-v-v




>---------- End Event Monitoring Service Event Notification ----------<

>------------ Event Monitoring Service Event Notification ------------<

Notification Time: Tue May 30 09:46:07 2006

sathes4 sent Event Monitor notification information:

/system/events/memory/8 is >= 1.
Its current value is MAJORWARNING(3).



Event data from monitor:

Event Time..........: Tue May 30 09:46:07 2006
Severity............: MAJORWARNING
Monitor.............: dm_memory
Event #.............: 4300


Summary:
Memory Event Type : Single bit error (SBE) event. A correctable single
bit error has been detected and logged.


Description of Error:

The memory component:

Cab/Cell or Node: 0
MC/EXT: 0
DIMM: 2b
Serial Number: N/A
Part Number: N/A

is experiencing correctable single bit errors (SBE) on a single
component.

Probable Cause / Recommended Action:

Although the single bit errors are being corrected, it may be advisable to
monitor the situation. If an excessive rate of single bit errors occur, an
event with higher severity will be generated.

Additional Event Data:
System IP Address...: 172.30.104.27
Event Id............: 0x447bea2f00000000
Monitor Version.....: B.01.00
Event Class.........: I/O
Client Configuration File...........:
/var/stm/config/tools/monitor/default_dm_memory.clcfg
Client Configuration File Version...: A.01.00
Qualification criteria met.
Number of events..: 70
Received within...: 7 day(s)
Associated OS error log entry id(s):
None
Additional System Data:
System Model Number.............: 9000/800/L1000-36
EMS Version.....................: A.04.00
STM Version.....................: A.45.00
Latest information on this event:
http://docs.hp.com/hpux/content/hardware/ems/dm_memory.htm#4300

v-v-v-v-v-v-v-v-v-v-v-v-v D E T A I L S v-v-v-v-v-v-v-v-v-v-v-v-v




>---------- End Event Monitoring Service Event Notification ----------<

>------------ Event Monitoring Service Event Notification ------------<

Notification Time: Tue May 30 10:23:10 2006

sathes4 sent Event Monitor notification information:

/system/events/memory/8 is >= 1.
Its current value is SERIOUS(4).



Event data from monitor:

Event Time..........: Tue May 30 10:23:10 2006
Severity............: SERIOUS
Monitor.............: dm_memory
Event #.............: 4400


Summary:
Memory Event Type : Single bit error (SBE) event. A correctable single
bit error has been detected and logged.


Description of Error:

The memory component:

Cab/Cell or Node: 0
MC/EXT: 0
DIMM: 2b
Serial Number: N/A
Part Number: N/A

is experiencing a high rate of correctable single bit errors on a
single component.

Probable Cause / Recommended Action:

Although the single bit errors are being corrected, it is advisable to
closely monitor the situation. If an excessive rate of single bit errors
occur, an event with higher severity will be generated.

Additional Event Data:
System IP Address...: 172.30.104.27
Event Id............: 0x447bf2de00000000
Monitor Version.....: B.01.00
Event Class.........: I/O
Client Configuration File...........:
/var/stm/config/tools/monitor/default_dm_memory.clcfg
Client Configuration File Version...: A.01.00
Qualification criteria met.
Number of events..: 100
Received within...: 7 day(s)
Associated OS error log entry id(s):
None
Additional System Data:
System Model Number.............: 9000/800/L1000-36
EMS Version.....................: A.04.00
STM Version.....................: A.45.00
Latest information on this event:
http://docs.hp.com/hpux/content/hardware/ems/dm_memory.htm#4400

v-v-v-v-v-v-v-v-v-v-v-v-v D E T A I L S v-v-v-v-v-v-v-v-v-v-v-v-v




>---------- End Event Monitoring Service Event Notification ----------<

>------------ Event Monitoring Service Event Notification ------------<

Notification Time: Tue May 30 10:54:13 2006

sathes4 sent Event Monitor notification information:

/system/events/memory/8 is >= 1.
Its current value is CRITICAL(5).



Event data from monitor:

Event Time..........: Tue May 30 10:54:13 2006
Severity............: CRITICAL
Monitor.............: dm_memory
Event #.............: 4200


Summary:
Memory Event Type : Single bit error (SBE) event. A correctable single
bit error has been detected and logged.


Description of Error:

The memory component:

Cab/Cell or Node: 0
MC/EXT: 0
DIMM: 2b
Serial Number: N/A
Part Number: N/A

is experiencing an excessive rate of single bit errors on a single
component.

Probable Cause / Recommended Action:

Although the single bit errors are being corrected, it is strongly advisable
to closely monitor the situation. This condition indicates a potential
problem. Contact your HP support representative to check the memory boards.

Additional Event Data:
System IP Address...: 172.30.104.27
Event Id............: 0x447bfa2500000000
Monitor Version.....: B.01.00
Event Class.........: I/O
Client Configuration File...........:
/var/stm/config/tools/monitor/default_dm_memory.clcfg
Client Configuration File Version...: A.01.00
Qualification criteria met.
Number of events..: 120
Received within...: 1 day(s)
Associated OS error log entry id(s):
None
Additional System Data:
System Model Number.............: 9000/800/L1000-36
EMS Version.....................: A.04.00
STM Version.....................: A.45.00
Latest information on this event:
http://docs.hp.com/hpux/content/hardware/ems/dm_memory.htm#4200
6 REPLIES 6
Chauhan Amit
Respected Contributor

Re: Problem with the RAM

Hello ,

I would suggest you to replace the DIMMS 2a/2b immediately.
Check PDT entries as well.

-Amit
If you are not a part of solution , then you are a part of problem
Albert_31
Trusted Contributor

Re: Problem with the RAM

Hi,

All single bit errors dont necessarily mean a faulty dimm.. so it is best to collect the cstm info output and post it for us to see..

# echo "map selall info; wait infolog" | cstm > /filename

Andrew Merritt_2
Honored Contributor
Solution

Re: Problem with the RAM

Hi,
Yes, those EMS events are telling you that you have a real problem with those DIMMs.

Albert is right that a small number of SBEs is acceptable, which is exactly why the OnlineDiags are present; they monitor the hardware and send appropriate EMS events when certain thresholds are reached. In this case, the 4200 event was generated when over 120 SBEs on the DIMM were recorded in 24 hours.

The documentation page shows the thresholds for the different events: http://docs.hp.com/en/diag/ems/dm_memory.htm

I would also recommend that you upgrade to a current, supported, version of the OnlineDiags. You have the A.45.00 version, which was the June 2004 release. See http://www.docs.hp.com/en/diag/stm/stm_upd.htm#table to see the supported versions.

The link to download the latest OnlineDiags is:

http://www.software.hp.com/portal/swdepot/displayProductInfo.do?productNumber=B6191AAE

You can also go to http://www.software.hp.com and then type "B6191AAE" in the search box.

http://docs.hp.com/en/diag/stm/stm_ptch.htm shows the latest patches. A.49 (HWE0509) is the latest version of OnlineDiags for 11.11, and the latest patch for that (to be applied after you upgrade to A.49) is PHSS_34288.

Andrew
Josiah Henline
Valued Contributor

Re: Problem with the RAM

That is not enough information to determain if memory needs to be replaced.

Run the following an paste the output:
cstm
sel pa 8
info
wai
il

If at first you don't succeed, read the man page.
rmueller58
Valued Contributor

Re: Problem with the RAM

When I had single bit errors I went into CSTM
did a

MAP

then sel dev # (for corresponding memory)


cstm>map
esuunix1

Dev Last Last Op
Num Path Product Active Tool Status
=== ==================== ========================= =========== =============
1 system system () Information Successful
2 0 Bus Adapter (803)
3 0/0 PCI Bus Adapter (782)
4 0/0/0/0 Core PCI 100BT Interface
5 0/0/1/0 PCI SCSI Interface (10000 Information Successful
6 0/0/1/0.0.0 SCSI Tape (HPC1537A)
7 0/0/1/0.2.0 SCSI Disk (HPDVD-ROM)
8 0/0/2/0 PCI SCSI Interface (10000
9 0/0/2/0.6.0 SCSI Disk (SEAGATEST39204
10 0/0/2/1 PCI SCSI Interface (10000
11 0/0/2/1.6.0 SCSI Disk (SEAGATEST39204
12 0/0/4/0 RS-232 Interface (103c104
13 0/0/5/0 RS-232 Interface (103c104
14 0/1 PCI Bus Adapter (782)
15 0/2 PCI Bus Adapter (782)
16 0/4 PCI Bus Adapter (782)
17 0/4/0/0 PCI Terminal Multiplexor
18 0/5 PCI Bus Adapter (782)
19 0/5/0/0 PCI SCSI Interface (10000
20 0/5/0/0.3.0 SCSI Tape (QUANTUMDLT8000 Information Successful
21 0/8 PCI Bus Adapter (782)
22 0/8/0/0 PCI Gigabit Ethernet Link
23 0/10 PCI Bus Adapter (782)
24 0/12 PCI Bus Adapter (782)
25 0/12/0/0 PCI 100 BaseT LAN Interfa
26 1 Bus Adapter (803)
27 1/0 PCI Bus Adapter (782)
28 1/2 PCI Bus Adapter (782)
29 1/4 PCI Bus Adapter (782)
30 1/4/0/0 PCI Bus Adapter (80860964
31 1/4/0/1 I2O Interface Adapter (RA
32 1/4/0/1.0.0.0 SCSI Disk (I2ORAID1)
33 1/4/0/1.0.0.1 SCSI Disk (I2ORAID1)
34 1/4/0/1.0.0.2 SCSI Disk (I2ORAID1)
35 1/4/0/1.0.0.3 SCSI Disk (I2ORAID1)
36 1/8 PCI Bus Adapter (782)
37 1/10 PCI Bus Adapter (782)
38 1/10/0/0 PCI 100 BaseT LAN Interfa
39 1/12 PCI Bus Adapter (782)
40 1/12/0/0 PCI SCSI Interface (10000
41 37 CPU (5d3) Information Killed
42 45 CPU (5d3)
43 101 CPU (5d3)
44 109 CPU (5d3)
45 192 MEMORY (90) Information Successful

then
sel dev # (memory entry)

info

then il

This will show you when set is bad or in error.

We had to replace a couple of DIMMS.. Single bit error mean very little.. CSTM will should if one of the slots is Dead.
rmueller58
Valued Contributor

Re: Problem with the RAM

oops

do a MAP before INFO