Operating System - HP-UX
1838271 Members
3020 Online
110125 Solutions
New Discussion

Re: EMS Event Notification from /system/events/memory/63 ... CRITICAL

 
SOLVED
Go to solution
GRP_2
Occasional Advisor

EMS Event Notification from /system/events/memory/63 ... CRITICAL

Hello ,

On a HP9000/813 with HP-UX 10.20, we've got an error message from EMS on the syslog.log file.

After executing the command "/opt/resmon/bin/resdata -R 50200578 -r /system/events/memory/63 -n 50200733 -a" (find in the syslog.log file), we got this

>>>>> BEGINNING
CURRENT MONITOR DATA:

Event Time..........: Wed Mar 6 10:06:50 2002
Severity............: SERIOUS
Monitor.............: dm_memory
Event #.............: 4400
System..............: nodename
Summary:
Memory Event Type : Single bit error (SBE) event. A correctable single
bit error has been detected and logged.


Description of Error:

The memory component:

Cell/Node: 0
MC/EXT: 0
DIMM: 0A

is experiencing a high rate of correctable single bit errors on a
single component.

Probable Cause / Recommended Action:

Although the single bit errors are being corrected, it is advisable to
evaluate whether any memory replacement is warranted at this time. If an
excessive rate of single bit errors occur, an event with higher severity
will be generated.

Additional Event Data:
System IP Address...:
Event Id............: 0x3c85dc2a00000000
Monitor Version.....: B.01.00
Event Class.........: I/O
Client Configuration File...........:
/var/stm/config/tools/monitor/default_dm_memory.clcfg
Client Configuration File Version...: A.01.00
Qualification criteria met.
Number of events..: 100
Received within...: 7 day(s)
Associated OS error log entry id(s):
None
Additional System Data:
System Model Number.............: 9000/813
EMS Version.....................: A.03.20
STM Version.....................: A.24.00
Latest information on this event:
http://docs.hp.com/hpux/content/hardware/ems/dm_memory.htm#4400

v-v-v-v-v-v-v-v-v-v-v-v-v D E T A I L S v-v-v-v-v-v-v-v-v-v-v-v-v



Component Data:
Message in ll_msg (set: 50 msg: 10) did not exist in catalog.
Catalog type is MONITOR_INFO
Catalog version is A.01.00
Module name is dm_memory
Message set number is 50
Message number is 10
Message size is 72
Message parameter 1:
Message parameter size = 12
Message parameter is a literal
Literal text is 63
Message in ll_msg (set: 50 msg: 11) did not exist in catalog.
Catalog type is MONITOR_INFO
Catalog version is A.01.00
Module name is dm_memory
Message set number is 50
Message number is 11
Message size is 72
Message parameter 1:
Message parameter size = 12
Message parameter is a literal
Literal text is 20
<<<<< END

Can you help us : what does this mean? how can we resolve this problem?... We are thinking about a hardware problem.

Thanks in advance.
7 REPLIES 7
Sanjay_6
Honored Contributor
Solution

Re: EMS Event Notification from /system/events/memory/63 ... CRITICAL

Hi,

this could mean problem with one of the memory modules on your system. Do this,

echo 'selclass qualifier memory;info;wait;infolog' |cstm >/tmp/meminfo.txt

more /tmp/meminfo.txt

You can see the details on any errors and the module that is giving errors. If it is regular feature, it would be good to get the module(s) replaced.

Hope this helps.

Regds
Sandip Ghosh
Honored Contributor

Re: EMS Event Notification from /system/events/memory/63 ... CRITICAL

According to HP, we should not be worried for single bit memory error. According to them all the HP RAM are sigle bit correctable memory. Anyway , you can go to stm -->select memory-->go to tools-->go to information-->run
you can send this output to HP for their views.

Sandip
Good Luck!!!
Martin Burnett_2
Trusted Contributor

Re: EMS Event Notification from /system/events/memory/63 ... CRITICAL

Hello,

The event 4400 for memory indicates that 120 occurrences of single bit errors (SBE) in 7 days. As stated it is recommended that you monitor the situation closely. These error messages are sometimes accompanied by event 4200, 4300 or 4500. Each one has a different threshold for number of errors over a given time. If this is a production system or high availability required system you probably need to go ahead and open a hardware service call.

As stated in the thread above you can use the cstm, xstm or mstm interfaces to gain further insite on what's going on. If you need help with the diagnostic tools go to http://docs.hp.com/hpux/diag/index.html

You can accomplish logging a hardware call on ITRC via:

Maintenace & Support -> Collaborate-> hardware calls
Paula J Frazer-Campbell
Honored Contributor

Re: EMS Event Notification from /system/events/memory/63 ... CRITICAL

Hi
From your error log the error is not always on the same memory address.

Log a hardware call with a view to getting this memory swapped out.


HTH

Paula
If you can spell SysAdmin then you is one - anon
Martin Burnett_2
Trusted Contributor

Re: EMS Event Notification from /system/events/memory/63 ... CRITICAL

Hello,

According to your report output:

1. All of the memory errors are occurring on the Phys Bank 0 (check the address range is 0x00....while the Phys Bank 1 has an address range of 0x10...). As has been stated before you probably need to have HW call opened and the memory replaced.

2. As far as trying to determine if this is causing your application to fail, you can use the ipcs command to determine what shared memory addresses are being used by the application, however these addresses may move around depending on what memory is available at the time the application is started and when it gets a slice of memory to use. Therefore you would need to know the memory being used at the time of failure to make any kind of judgement.

3. In the future I would highly recommend that you remove any node specific information (IP addresses, hostname etc.) that might be a security risk to your systems before posting information on public web sites. This information is normally not needed by people trying to help you resolve the issue and while I would hope that this information could not be used destructively...better to be safe than sorry.

4. To that end I am removing your attachment and reposting it minus the node information.

5. Please assign points to the hard working forum members that have given time and effort to provide suggestions to help you resolve your issue.

Martin
Martin Burnett_2
Trusted Contributor

Re: EMS Event Notification from /system/events/memory/63 ... CRITICAL

This is a repost of the original response by GRP edited to remove non-essential information:

Hi,

First of all, thanks a lot ...

The result of the "echo 'selclass ...; info; wait; infolog' | cstm" is in the attached file.

- Can you help to understand this log? I think that there is hard problem on a memory bank, but which one ? Is that right ?

- On this server, we have an application which often crashed. Is it possible to link these crashes with our memory problem and how ?

Thanks in advance.
Sanjay_6
Honored Contributor

Re: EMS Event Notification from /system/events/memory/63 ... CRITICAL

Hi,

Looks to me as if the problem is with the 2nd module,

8 1 0A/0B 256 enabled 0x10000000 - 1fffffff N/A

Error address :0x00000000142667d8

Hope this helps.

Regds