1843980 Members
1747 Online
110226 Solutions
New Discussion

Syslog Message.

 
SOLVED
Go to solution
david_252
Frequent Advisor

Syslog Message.

Hi:

I have attached the syslog message(11.0 n-class)and also the corresponding action i took. Please let me know what i should do further...

Syslog Message:
Jan 20 12:49:05 EMS [1827]: ------ EMS Event Notification ------
Value: "MAJORWARNING (3)" for Resource: "/system/events/memory/192"
(Threshold: >= " 3") Execute the following command to obtain
event details: /opt/resmon/bin/resdata -R 119734274 -r
/system/events/memory/192 -n 119734277 -a
-----------------------------------------------
/opt/resmon/bin/resdata -R 119734274 -r /system/events/memory/192
-n 119734277 -a OUTPUT BELOW.....



CURRENT MONITOR DATA:

Event Time..........: Mon Jan 20 12:49:05 2003
Severity............: MAJORWARNING
Monitor.............: dm_memory
Event #.............: 4300
System..............:

Summary:
Memory Event Type : Single bit error (SBE) event. A correctable single
bit error has been detected and logged.


Description of Error:

The memory component:

Cab/Cell or Node: 0
MC/EXT: 1
DIMM: 0b

is experiencing correctable single bit errors (SBE) on a single
component.

Probable Cause / Recommended Action:

Although the single bit errors are being corrected, it may be advisable to
monitor the situation. If an excessive rate of single bit errors occur, an
event with higher severity will be generated.

Additional Event Data:
System IP Address...:
Event Id............: 0x3e2c369100000000
Monitor Version.....: B.01.00
Event Class.........: I/O
Client Configuration File...........:
/var/stm/config/tools/monitor/default_dm_memory.clcfg
Client Configuration File Version...: A.01.00
Qualification criteria met.
Number of events..: 70
Associated OS error log entry id(s):
None
Additional System Data:
System Model Number.............: 9000/800
EMS Version.....................: A.03.20
STM Version.....................: A.30.00
Latest information on this event:
http://docs.hp.com/hpux/content/hardware/ems/dm_memory.htm#4300

v-v-v-v-v-v-v-v-v-v-v-v-v D E T A I L S v-v-v-v-v-v-v-v-v-v-v-v-v



Component Data:
Physical Device Path....: 192
Tag 2...................: 20

Thanks in advance,

David.
13 REPLIES 13
Rick Garland
Honored Contributor

Re: Syslog Message.

This is reporting a single-bit error. Every once in awhile you could receive these errors. As long as the errors are not frequent and/or numerous you can monitor. If you receive numerous single-bit errors and/or they occur frequently, you most likely have memory issues. You can use stm to provide more - example, which bank is reporting the trouble.

Do keep an eye on.
Marco Santerre
Honored Contributor

Re: Syslog Message.

At this point, you're getting some errors in memory which are being self-corrected. At this point, I would only continue to look to see how often this repeats itself.

You also may want to run stm and check run some information against your memory to check your PDT. If you got a couple of entries in there, it may be a good thing to start placing a call for a possible problem on your memory board.
Cooperation is doing with a smile what you have to do anyhow.
Jeff Schussele
Honored Contributor

Re: Syslog Message.

Hi David,

As Rick has mentioned a sole single-bit error or several over a long period is nothing to become alarmed about - that's what ECC memory is designed to handle.
What I would recommend is to go into stm & check the PDT (Page Deallocation Table) & verify that you don't have a bunch of pages deallocated. That would indicate a DIMM(s) is going bad.

Rgds,
Jeff
PERSEVERANCE -- Remember, whatever does not kill you only makes you stronger!
Hai Nguyen_1
Honored Contributor

Re: Syslog Message.

David,

Since the error was self corrected, there is nothing you can do now but keep monitoring the situation based on the following action recommended in your post:

Probable Cause / Recommended Action:

Although the single bit errors are being corrected, it may be advisable to
monitor the situation. If an excessive rate of single bit errors occur, an
event with higher severity will be generated.

Hai
fg_1
Trusted Contributor

Re: Syslog Message.

David

We had started receiving this error on our V-2600 server and it started out similar about 1 every month or so, then without warning we lost a STICK of Memory and the server de-allocated the memory from usage. We had to schedule a downtime period to have the CE come in and replace the STICK.

My advice to you is go ahead and have the STICK replaced as long as you have a Maintenance Agreement on the server. Don't wait for it to manifest itself into a bigger problem.

Gl

Frank G.
Eugeny Brychkov
Honored Contributor

Re: Syslog Message.

I strongly agree with everyone telling that you should replace this memory module. As soon as it reports single-bit errors (70 events), they can be corrected with CRC, but if double-bit error will occur server will likely to crash
Eugeny
david_252
Frequent Advisor

Re: Syslog Message.

Thanks a lot. Since we have a maintenance agreement i think i can try that. But meanwhile, if i have to run stm, Can i run it as non-root user? If so can it be run at any time of the day? (so as not to disturb the peak period) and pl. advise me what to look fo in that report.

Thanks much for all the responses..

Thanks
David.
Marco Santerre
Honored Contributor

Re: Syslog Message.

Yes you can run it as a non-root user.

What you have to do is basically, find the memory and highlight it, then click on Information, then click on Run. This will generate a log which at the bottom of the log, you will find the PDT Table Entries. It will tell you how many are Free, how many are Used, and how many are Available
Cooperation is doing with a smile what you have to do anyhow.
Marco Santerre
Honored Contributor

Re: Syslog Message.

Sorry, I forgot. Yes you can run it at any time of the day. It is non-disrtuptive, espceially when you only go and get information
Cooperation is doing with a smile what you have to do anyhow.
david_252
Frequent Advisor

Re: Syslog Message.

Hi Again:

I have attached my stm report. Can someone suggest accordingly.

Thanks
David.
Jeff Schussele
Honored Contributor
Solution

Re: Syslog Message.

Hi David,

Well...3 pages (12 Mb) deallocated is not that bad. But on the otherhand they're all on the same DIMM. You're not in any imminent danger but I would think that you should schedule downtime to have that particular DIMM replaced under warranty within the next couple of weeks...just to be safe....and to get that 12 MB of RAM back.

Rgds,
Jeff
PERSEVERANCE -- Remember, whatever does not kill you only makes you stronger!
david_252
Frequent Advisor

Re: Syslog Message.

Hi:

Thanks for all the answers. I just got couple of more questions to clarify/learn:

1.How do we know from the reports (attached above)that it occurs on the same DIMM (Jeff pl.help_

2. CRC - what is this

3. Is STM the only way to find this error or is there any other alternate method?

Please advise.

Thanks
David.
Jeff Schussele
Honored Contributor

Re: Syslog Message.

Hi (again) David,

Ok, I'll tackle your questions.

1) From the stm report you'll note:
A) on the err log summary, ALL errors occurred on DIMM EXT1 (mem carrier 1) / Ob (slot b - all memory comes in DIMM pairs. In this case 0/a & 0/b are a pair). So you see that ALL errors are on the 0b DIMM
B) on the PDT - you can match back the addr (0x03dbf38, b38 & fb8) to the summary to verify that these are all on EXT1/0b, even the the PDT references both 0a & 0b.

2) CRC - Cyclical Redundancy Check. This is the err checking schema employed. Basically a checksum used to calculate whether the value has changed or not.

3) Although STM is the best way, there are others. The system log (/var/adm/syslog/syslog.log) as well as the GSP error log are other places to find these errors denoted.

HTH,
Jeff
PERSEVERANCE -- Remember, whatever does not kill you only makes you stronger!