Operating System - HP-UX
1831289 Members
3164 Online
110022 Solutions
New Discussion

Re: Possible problem with memory module?

 
Yogeeraj_1
Honored Contributor

Possible problem with memory module?

hello experts,

Since quite some time, one of our server root user has been receiving a mail message with subject: "Subject: L1000: Event Monitor Notification
".

The first lines are as follows:
============================================================
L1000 sent Event Monitor notification information:

/system/events/memory/8 is >= 3.
Its current value is CRITICAL(5).

Event data from monitor:

Event Time..........: Wed Jan 15 08:27:47 2003
Severity............: CRITICAL
Monitor.............: dm_memory
Event #.............: 4500
System..............: L1000

Summary:
Memory Event Type : Single bit error (SBE) event. A correctable single
bit error has been detected and logged.


Description of Error:

The memory component:

Cell/Node: 0
MC/EXT: 0
DIMM: 1b

is experiencing an excessive number of single bit errors.

Probable Cause / Recommended Action:

Although the single bit errors are being corrected, it is strongly advisable
to evaluate whether any memory replacement is warranted at this time. This
condition indicates a potential problem.

============================================================
(see attachment for full details)

We have restarted the several times and the problem is persisting.

The Memory modules that we are using are non-HP (Dataram).

There are no errors at the OS level so far (/var/adm/syslog/syslog.log)

How do i troubleshoot further?

Thank you in advance for the replies.

Best Regards
Yogeeraj
No person was ever honoured for what he received. Honour has been the reward for what he gave (clavin coolidge)
9 REPLIES 9
Stefan Farrelly
Honored Contributor

Re: Possible problem with memory module?

Ive seen this problem lots of times before. Solution is;

1. Install the latest diagnostics and see if it recurs. This sometimes fixes it - a bug in the diags.

2. Reseat the memory. Shutdown, remove, and simply re-insert, then clear the PDT table on reboot (in the PDC). This almost always fixes it.

Its very unlikely, in my experience, its an actual problem with the physical memory (<1%).
Im from Palmerston North, New Zealand, but somehow ended up in London...
Yogeeraj_1
Honored Contributor

Re: Possible problem with memory module?

Hi stefan,

Thank you for the reply.

One last question, how do i "clear the PDT table"?
or where do i get additional information on that?

Thank you for a reply

Best regards
Yogeeraj
No person was ever honoured for what he received. Honour has been the reward for what he gave (clavin coolidge)
Stefan Farrelly
Honored Contributor

Re: Possible problem with memory module?


On an L class you do it when the server boots, interrupt the auto boot sequence, then go to, I think, the coniguartion menu then there is an option there to clear the PDT table. You can confirm it by checking your L class users or ops manua - or search for it on docs.hp.com
Im from Palmerston North, New Zealand, but somehow ended up in London...
Yogeeraj_1
Honored Contributor

Re: Possible problem with memory module?

OK.

I guess the L-call boot menu interface is similar to that of a V-Class.
===================================================================================
Command Description
------- -----------
AUto [BOot|SEArch|Force ON|OFF] Display or set the specified flag
BOot [PRI|ALT| ] Boot from a specified path
BootTimer [time] Display or set boot delay time
CLEARPIM Clear PIM storage
CPUconfig [] [ON|OFF|SHOW] (De)Configure/Show Processor
DEfault Set the system to defined values
DIsplay Display this menu
ForthMode Switch to the Forth OBP interface
IO List the I/O devices in the system
LS [|flash] List the boot or flash volume
PASSword Set the Forth password
PAth [PRI|ALT|CON] [] Display or modify a path
PDT [CLEAR|DEBUG] Display/clear Non-Volatile PDT state
PIM_info [cpu#] [HPMC|TOC|LPMC] Display PIM of current or any CPU
RemoteCommand node# command Execute command on a remote node
RESET [hard|debug] Force a reset of the system
RESTrict [ON|OFF] Display/Select restricted access to Forth
SCSI [INIT|RATE] [bus slot val] List/Set SCSI controller parms
SEArch [] Search for boot devices
SECure [ON|OFF] Display or set secure boot mode
TIme [cn:yr:mo:dy:hr:mn[:ss]] Display or set the real-time clock
VErsion Display the firmware versions
[0] Command:
===================================================================================

What are the risks of something going wrong after clearing the Non-volatile PDT state? ;)
This is a production server.

We have already planned for a downtime early Friday morning.

Regards
Yogeeraj
No person was ever honoured for what he received. Honour has been the reward for what he gave (clavin coolidge)
Stefan Farrelly
Honored Contributor

Re: Possible problem with memory module?

There is no danger in clearing the PDT table. But before you do so I would either install the latest diags or and/or reseat the memory so that you can see if your single bit errors recur after the reboot and PDT clear.
Im from Palmerston North, New Zealand, but somehow ended up in London...
Yogeeraj_1
Honored Contributor

Re: Possible problem with memory module?

Thank you Stefan,

Most probably, we will just re-seat the memory modules and clear the PDT. After that, we will monitor if the problem is recurring.

will update this post again later.

Thank you again for your time and precious guidance.

Best Regards
Yogeeraj
No person was ever honoured for what he received. Honour has been the reward for what he gave (clavin coolidge)
Eugeny Brychkov
Honored Contributor

Re: Possible problem with memory module?

Reported problem is pointing to memry carrier 0 DIMM in slot 1b. As soon as it experiences single bit errors, data can be recovered. If double bit error will occur server may crash.
I recommend you, until you'll replace suspicious DIMM, not to clear PDT (from service menu in PDC), but simpy dump stm memory information output into file for future reference. If you'll see that this DIMM reports ~10 single bit errors for 2 years of functioning, this may be ok, but if you'll see 100-200 single bit errors, then replace.
Anyway, if you have HP contract, call them and replace DIMM. Then clear PDT.
Keep in mind that if double bit error will occur then system will crash. So you to decide - replace or not replace
Eugeny
Steven E. Protter
Exalted Contributor

Re: Possible problem with memory module?

Is dmesg of any help in these circumstances?

I've never had a bad memory module(stay away evil eye!) and was wondering if this might help.

Steve
Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
Yogeeraj_1
Honored Contributor

Re: Possible problem with memory module?

closing thread.
It happened to be a deffective third party memory module which we replaced.
thanks again
Yogeeraj
No person was ever honoured for what he received. Honour has been the reward for what he gave (clavin coolidge)