Operating System - HP-UX
1829103 Members
2217 Online
109986 Solutions
New Discussion

Re: memory error question

 
SOLVED
Go to solution
Gary Glick
Frequent Advisor

memory error question

I've attached the stm output for my system memory. I've noticed several entries with multi-bit errors but the error count is zero. The fact that multi-bit errors are even mentioned has me nervous. Could someone with a bit more experience than myself check the attched file and let me know what they think.

The server is a D370 running hp-ux 11.0 and Openmail. Openmail is failing to start after a power failure.

Thank you

Gary
13 REPLIES 13
Christian Tremblay
Trusted Contributor

Re: memory error question

I suggest that you open a hardware call with HP support and have them take a look at your stm output, as double bit memory errors usually means bad memory hardware.
Torsten.
Acclaimed Contributor

Re: memory error question

A multi-bit error will cause a crash (HPMC) immediately. Did this happen last weeks?

The errors may caused by bad DIMMs or even a bad memory controller. Some sources are telling about "electronic smog" as a cause.

The DIMM 3A has a high count of errors.
Some areas of your memory are already marked bad and no longer used (PDT). Normally a replacement of the DIMMs is needed.

Is this system really up since 2003 without reboot? Please check "uptime".


Hope this helps!
Regards
Torsten.

__________________________________________________
There are only 10 types of people in the world -
those who understand binary, and those who don't.

__________________________________________________
No support by private messages. Please ask the forum!

If you feel this was helpful please click the KUDOS! thumb below!   
Torsten.
Acclaimed Contributor

Re: memory error question

I converted your attachment, perhaps this is better readable to others.

Hope this helps!
Regards
Torsten.

__________________________________________________
There are only 10 types of people in the world -
those who understand binary, and those who don't.

__________________________________________________
No support by private messages. Please ask the forum!

If you feel this was helpful please click the KUDOS! thumb below!   
Jeff Schussele
Honored Contributor

Re: memory error question

Hi Gary,

Two things I notice about the report:

1) The counts on *all* the multi-bit are zero.
2) None of the addresses for those errors are in the PDT.

This could indicate that they're old errors & those DIMMS have been replaced & the PDT reset.
Or those are bogus messages.
I supect the former since the Memory Error Log History contains no dates for those errors.
Remember that *any* multi-bit error while the system is up & running will cause a panic. If a multi-bit is detected during POST then the address will be added to the PDT & bootup will proceed.

Rgds,
Jeff
PERSEVERANCE -- Remember, whatever does not kill you only makes you stronger!
Torsten.
Acclaimed Contributor

Re: memory error question

Even if the latest firmware for your system was created in 2002 (42.11), can you check your version?

Try

# echo "selclass qualifier cpu;info;wait;il"|cstm|grep "PDC Firmware"

Hope this helps!
Regards
Torsten.

__________________________________________________
There are only 10 types of people in the world -
those who understand binary, and those who don't.

__________________________________________________
No support by private messages. Please ask the forum!

If you feel this was helpful please click the KUDOS! thumb below!   
Gary Glick
Frequent Advisor

Re: memory error question

Firmware revision:

PDC Firmware Revision: 42.11 IODC Revision: 0
PDC Firmware Revision: 42.11 IODC Revision: 0


I've resat the memory in the system that was causing the errors and my problem remains. I'm going to remove the ones that were throwing the errors and see if that helps.

Something else that may be of issue is this message in the syslog.log file:

May 31 13:47:57 gomail vmunix: SCSI: Unexpected Disconnect -- lbolt: 23946, dev:
cb05f002, io_id: 500002d

There are dozens of these. I know it's a scsi device, but how do I find out which one is causing the messages? This is a d-class server with 3 SCSI disks and it's connected to a VA7410 disk array. All of the lbolts read the same. I have noticed that one of the disks 8/4.4.0 has had a number of retries as listed by cstm in the information display.
Torsten.
Acclaimed Contributor

Re: memory error question

If you have still a contract for this machine (what is possible IMHO), let HP exchange the DIMMs. If you don't have one, you can ask HP, but it can be a little expensive. It's up to you.

Regarding the SCSI errors, looks like a bad disk, but a closer look (more information) is needed.

Hope this helps!
Regards
Torsten.

__________________________________________________
There are only 10 types of people in the world -
those who understand binary, and those who don't.

__________________________________________________
No support by private messages. Please ask the forum!

If you feel this was helpful please click the KUDOS! thumb below!   
Michael Steele_2
Honored Contributor

Re: memory error question

To the best of my knowledge d-class servers are no supported by HP with new firmware or patch releases and 42.11 is the last.
Support Fatherhood - Stop Family Law
Gary Glick
Frequent Advisor

Re: memory error question

I'm going to yank the memory in question to see if that helps.

With regards to the lbolt. any ideas on what to look for? Here's what I've tried:
1. All disks come up clean in ioscan.
2. ran the echo 2400?20X | adb /dev/dsk/c0txd0 againt the drives and they return normally.
3. Ran stm info against the disks and they all com back clean except the disk at 8/4.4.0 displays a total of 66 retries.
I'm not sure what else to check.

While the system is down I'll reseat the the disks and double check the cabling.
Torsten.
Acclaimed Contributor
Solution

Re: memory error question

The disk is internal I guess. So there should be no problem with cabling or termination, if the other disks are working without problems. You can run a

# dd if=/dev/rdsk/cxtydz of=/dev/null

against it. If you get I/O errors or see errors in stm, consider to replace this drive.

If you remove a pair of DIMMs, re-sort the others. There must not be any gaps - slots 0, 1 ,2 ... have to be filled.

I would clear the PDC in the service menu in BCH after re-sorting the DIMMs and run several loops of exercise test from stm. Be aware this can crash your system if the DIMMs are really bad. Do this only without running a productive software. After the tests, have a look into the stm logs.

Hope this helps!
Regards
Torsten.

__________________________________________________
There are only 10 types of people in the world -
those who understand binary, and those who don't.

__________________________________________________
No support by private messages. Please ask the forum!

If you feel this was helpful please click the KUDOS! thumb below!   
Andrew Rutter
Honored Contributor

Re: memory error question

gary,

have you tested the memory with STM and run the exerciser tests? this would be agood place to start. You could also clear the PDT and then run the exercise tests in STM and then the logtool. See what comes up then.

Also to see where the lbolt error has come from check with
#ll /dev ¦ grep 05f002

see what this comes back with. It looks more likely a controller timing out.

Andy
Albert_31
Trusted Contributor

Re: memory error question

Hello Gary,

I dont find any issue with the memory as seen from the logs collected,

System start: Thu Nov 13 17:52:25 2003.
Last error check: Wed May 31 11:40:44 2006.
Logging interval: 3600 seconds.
15 address(es) with errors logged by memory logging daemo

If you notice, the last error check was May31, however the logging interval is still 3600 secs which the default time, in situations where there is really a memory fault, you will notice that the logging time will go on decreasing ex 15sec...

Every memory error does not mean a h/w fault, it could also be induced by an application as well.

As Jeff has already confirmed, the PDT entries dont match the ones which are having issues, which implies that none of the memory pages are marked bad, which itself is an indication that it is safe to leave the machine as it is and clear teh PDT table on the next available chance.

regards

Albert
Gary Glick
Frequent Advisor

Re: memory error question

Everyone was correct that the memory was not the issue, but I had it replaced anyway.

The original problem was that Openmail appeared to be unresponsive after startup after a power failure. I was trying to find a problem. As it turns out leaving it alone overnight while awaiting a pupport tech resolved the problem.

So it appears that there was no problem to begin with other than the system was trying to get caught up with it's processing. This seems to have been a reocurring theme of my questions lately. :-)

Thanks to everytone for their help

Gary