1837836 Members
2456 Online
110121 Solutions
New Discussion

memory errors

 
SOLVED
Go to solution
Brian Ham
Advisor

memory errors

I am noticing lots of memory errors from my syslog.log on HP-UX 11.0, N-Class server

here is the message: Any ideas what this means?
Title: dm_memory
Command: /usr/sbin/stm/uut/bin/tools/monitor/dm_memory
Vendor: Hewlett-Packard Company
Version: B.01.00
Monitor PID: 5784

Resources currently monitored:
/system/events/memory/192

Thanks
8 REPLIES 8
Ceesjan van Hattum
Esteemed Contributor

Re: memory errors

your description does not look like an error to me, just a notification that some monitoring is active.
Please search some logfiles accompagnied to this monitor or specify the actual errors...
Regards,
Ceesjan
S.K. Chan
Honored Contributor
Solution

Re: memory errors

Run cstm to check the memory log in more detail and post 'em here.. in cstm prompt, enter "ru logtool" and in Logtool prompt enter "vda".

# cstm
cstm > ru logtool
Logtool Utility > vda
Brian Ham
Advisor

Re: memory errors

sorry, my previous message was the information in the mail log. Here is what shows up in the syslog.log.

Mar 14 15:19:34 system EMS [4863]: ----- EMS Monitor Restart ----- Title: d
m_memory Command: /usr/sbin/stm/uut/bin/tools/monitor/dm_memory Vendor: Hew
lett-Packard Company Version: B.01.00 To obtain a list of currently monit
ored resources, execute the following: /opt/resmon/bin/resdata -M 3111254123

Helen French
Honored Contributor

Re: memory errors

Brian Ham
Advisor

Re: memory errors

From cstm> logtool....

-- Logtool Utility: View Memory Report --

System Start Time Sun Jul 1 17:16:55 2001

Last Error Check Time Thu Apr 4 10:30:03 2002

Logging Time Interval 120

Extender Card in Slot EXT0
==========================================================
DIMM Slot: 0b
Error Type: Single/unconfirmed: single-bit error that
could not be confirmed as either soft or hard.
Page Status: Pending: page could not be obtained.
Bit Num: 16
Logged By: Memlogd
First Detected: Wed Apr 3 23:16:08 2002

Last Detected: Thu Apr 4 10:30:03 2002

Error Count: 323
Error Addr: 0xf6b801
==========================================================

Extender Card in Slot EXT0
==========================================================
DIMM Slot: 0b
Error Type: Single/soft: unrepeatable single-bit error.
Page Status: Deallocated: page is no longer in use.
Bit Num: 16
Logged By: Memlogd
First Detected: Wed Apr 3 23:12:07 2002

Last Detected: Wed Apr 3 23:34:15 2002

Error Count: 2
Error Addr: 0x68307ec1
==========================================================

Extender Card in Slot EXT0
==========================================================
DIMM Slot: 0b
Error Type: Single/unconfirmed: single-bit error that
could not be confirmed as either soft or hard.



S.K. Chan
Honored Contributor

Re: memory errors

In that case try running ..

# /opt/resmon/bin/resdata -M 3111254123

the "311.." is the "monitor key", hopefully it'll show you more detail what's going on. I still think your best bet is look at cstm log file as suggested earlier.

Also run

# /opt/resmon/bin/resdata -h

for more detail of "resdata" syntax.


S.K. Chan
Honored Contributor

Re: memory errors

Single bit error ... (on memory at slot 0B)
Good news is the page can be deallocated upon detection of the error. I've seen single bit error which could not be deallocated and as a result the system has to be rebooted to deallocate 'em. Bad news is the system cannot determine if it's soft or hard error. Meaning there is high possibility this can be a hardware problem (ie memory problem). It's time to call HP for further cause of action. I would not be surprise if the memory has to be replaced.
Ross Martin
Trusted Contributor

Re: memory errors

Hello Brian,

A single bit error indicates that there was a threshold error count hit, that generated the message. These kinds of errors are always recoverable by the system.

The problem comes about with where the errors are occurring in memory -- user area versus system area. If the error occurs in the system area, this can bother the system, so, it deallocates memory and reassigns it somewhere else that is not have single-bit errors.

If the error count gets too high, it might be a candidate for a hardware repair, but this is a rarity.

The memory errors to rally be concerned with are multi-bit errors -- if this should occur, then the Error Correction algorithm may not be able to correct the error -- hence, causing an irrecoverable parity error. Hardware would definitely need to be addressed at that point.

Hope this helps.

Ross Martin