Operating System - HP-UX
1836993 Members
2071 Online
110111 Solutions
New Discussion

Re: EMS Event Notification (memory)

 
Nabil Boussetta
Frequent Advisor

EMS Event Notification (memory)

i have an rp8400 under hp-ux11i v1.
i have the following notification on the syslog file

Aug 18 01:00:12 CENTRAL1 EMS [2520]: ------ EMS Event Notification ------ Value: "MAJORWARNING (3)" for Resource: "/system/events/memory/0_5" (Threshold: >= " 3") Execute the following command to obtain event details: /opt/resmon/bin/resdata -R 165150722 -r /system/events/memory/0_5 -n 165150721 -a

Aug 21 03:41:44 CENTRAL1 EMS [2520]: ------ EMS Event Notification ------ Value: "SERIOUS (4)" for Resource: "/system/events/memory/0_5" (Threshold: >= " 3") Execute the following command to obtain event details: /opt/resmon/bin/resdata -R 165150722 -r /system/events/memory/0_5 -n 165150723 -a

Aug 22 08:26:17 CENTRAL1 EMS [2520]: ------ EMS Event Notification ------ Value: "MAJORWARNING (3)" for Resource: "/system/events/memory/0_5" (Threshold: >= " 3") Execute the following command to obtain event details: /opt/resmon/bin/resdata -R 165150722 -r /system/events/memory/0_5 -n 165150724 -a



when executing the CMD /opt/resmon/bin/resdata -R 1651 ... it gives

ARCHIVED MONITOR DATA:

Event Time..........: Sun Aug 21 03:41:44 2005
Severity............: SERIOUS
Monitor.............: dm_memory
Event #.............: 4400
System..............: CENTRAL1

Summary:
Memory Event Type : Single bit error (SBE) event. A correctable single
bit error has been detected and logged.


Description of Error:

The memory component:

Cab/Cell or Node: 0/1
MC/EXT: 0
DIMM: 0C

is experiencing a high rate of correctable single bit errors on a
single component.

Probable Cause / Recommended Action:

Although the single bit errors are being corrected, it is advisable to
closely monitor the situation. If an excessive rate of single bit errors
occur, an event with higher severity will be generated.

Additional Event Data:
System IP Address...: 172.20.0.101
Event Id............: 0x4307e9e800000000
Monitor Version.....: B.01.00
Event Class.........: I/O
Client Configuration File...........:
/var/stm/config/tools/monitor/default_dm_memory.clcfg
Client Configuration File Version...: A.01.00
Qualification criteria met.
Number of events..: 100
Received within...: 7 day(s)
Associated OS error log entry id(s):
None
Additional System Data:
System Model Number.............: 9000/800/S16K-A
EMS Version.....................: A.04.00
STM Version.....................: A.41.00
Latest information on this event:
http://docs.hp.com/hpux/content/hardware/ems/dm_memory.htm#4400

v-v-v-v-v-v-v-v-v-v-v-v-v D E T A I L S v-v-v-v-v-v-v-v-v-v-v-v-v



Component Data:
Physical Device Path....: 0/5
Tag 2...................: 20



my question is:
- should i replace the memory module.
- why the ems notification is sometime "MAJORWARNING (3)" and sometime "SERIOUS (4)"
6 REPLIES 6
Magnus Linner_2
Occasional Advisor

Re: EMS Event Notification (memory)

At this time you dont have to replace the memory. The memory you have can handle single bit error. But keep an eye if the situation gets worse.

Not sure why you get different warning levels
Devender Khatana
Honored Contributor

Re: EMS Event Notification (memory)

Hi,

It is required to change atleast one module listed here as the notification is repeating very frequently. The notification is serious and majorwarning depending upon the the no. of error reported in time. The description of EMS event whose details are provided here is majorwarning and the occurance has happened 100 times in 7 days. For the event with SERIOUS notifications you would find it more.

It is adviced to changes the part listed here.

HTH,
Devender
Impossible itself mentions "I m possible"
DCE
Honored Contributor

Re: EMS Event Notification (memory)

I had the same warning earlier this year. When I talked to an HP engineer, he stated that unless the error is occurring multiple times, it is not necessary to change the memory module. If, and when, the error starts recurring, then the module should be replaced.

Dave
Pedro Cirne
Esteemed Contributor

Re: EMS Event Notification (memory)

Hi,

You're getting errors almost every day...I think you should plan memory replacement, or risk an unplanned server crash...

Enjoy :)

Pedro
Mel Burslan
Honored Contributor

Re: EMS Event Notification (memory)

As previous posts indicated, Single Bit memory errors are not fatal to the system most of the time, but as a good sysadmin, it is your responsibility to keep an eye on the situation. This command will generate a memory report on your system:

echo 'selclass qualifier memory;info;wait;infolog' | cstm >/tmp/meminfo.`date +%m%d%H%M`

run this from cron hourly or at intervals you choose. In this report you will see a section as follows (yours may vary depending on the number of memory modules and their sizes but the look will be the same)

Memory Error Log Summary

Error
Ext/DIMM Error Address Error Type Page Count
------------ ------------------ ---------- --------- -----
EXT0/2b 0x00000000632a9681 Single-Bit 0x00632a9 1
EXT0/0a 0x000000007088c5c1 Single-Bit 0x007088c 12
EXT0/0a 0x00000000708cc5c1 Single-Bit 0x00708cc 2
EXT0/0b 0x000000002ccfd801 Single-Bit 0x002ccfd 1
EXT0/3a 0x0000000004bcfd01 Single-Bit 0x0004bcf 1
EXT0/2b 0x000000006767e281 Single-Bit 0x006767e 2
EXT0/0b 0x00000000510bc001 Single-Bit 0x00510bc 1
EXT0/1b 0x000000003758db81 Single-Bit 0x003758d 1
EXT0/1b 0x000000004198d101 Single-Bit 0x004198d 5
EXT0/1b 0x00000000419c1101 Single-Bit 0x00419c1 1
EXT0/3b 0x00000001853b9041 Single-Bit 0x01853b9 1
EXT0/3b 0x0000000009b9a5c1 Single-Bit 0x0009b9a 7
EXT0/2b 0x000000004367e1c1 Single-Bit 0x004367e 1

System start: Fri Nov 14 16:38:45 2003.
Last error check: Mon Aug 22 07:31:37 2005.


The important information for you is contained in the last column of each line, which is the error count on each module. If one or more of the modules have this count constantly increasing, I would call in for a hardware service to get it replaced. Otherwise, one or two counts every few days is not something you want to lose any sleep on.
________________________________
UNIX because I majored in cryptology...
Raj D.
Honored Contributor

Re: EMS Event Notification (memory)

Hi Nabil,

Please keep an eye on this error if it comes regularly ..or it came for only once.

# cat /var/adm/syslog/syslog.log | grep "memory/0_5"

and if its occuring everyday , take this with HP to get the memory module replace , before it can be serious.

Cheers ,

RajD.

" If u think u can , If u think u cannot , - You are always Right . "