1821061 Members
2938 Online
109631 Solutions
New Discussion юеВ

Possible memory hw-error

 
SOLVED
Go to solution
john korterman
Honored Contributor

Possible memory hw-error

Hi everyone,

on a model ia64 hp server rx8620, running B.11.23, we often se Sybase crash.

Sybase suggests that defective hw, especially memory, may be the cause.

However, there are no indications in syslog/dmesg of any hw-problem.

In another thread I saw Bill Hassel recommend the below command, which I execued on the machine:
# echo "selclass qualifier memory;info;wait;infolog" | cstm

and found this in the output:

Memory Error Log Summary


DIMM Location Error Address Error Type Page Count

---------------------- ---------------- ---------- ------------- -----

Cab 0 Cell 3 DIMM 1A 0x2063a3380 Single-Bit 0x2063a3 21

Cab 0 Cell 2 DIMM 3B 0xea5e80700 Single-Bit 0xea5e80 3

Cab 0 Cell 0 DIMM ├Ф┬╖ 0x81883cc78000a7b8 Multi-Bit 0x81883cc78000a N/A


Would you judge the above as hw-errors in need of repairing?

regards,
John K.
it would be nice if you always got a second chance
8 REPLIES 8
Steven E. Protter
Exalted Contributor

Re: Possible memory hw-error

Shalom,

Test memory with the excercize function of cstm mstm or xstm.

Also check dmesg output.

With these tools you will find the problem.

SEP
Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
A. Clay Stephenson
Acclaimed Contributor
Solution

Re: Possible memory hw-error

The single-bit errors are no cause for concern because the CRC mechanisms can correct these "on the fly" with no impact to applications and the OS but the multi-bit error is more serious --- I would definitely have that DIMM replaced. As long as the single-bit errors are infrequent (~ 1 per week or so) then I wouldn't worry about them. SBE's can actually be caused by things such as background radiation so it is unrealistic to expect zero SBE's over long periods of time ---- but they should be very rare.
If it ain't broke, I can fix that.
tkc
Esteemed Contributor

Re: Possible memory hw-error

check the file /var/opt/resmon/log/event.log, how frequent is the memory events logged?

all memory events are given in :
http://docs.hp.com/en/diag/ems/memory_ia64.htm

check the PDT and see how many % usage too.

if too many SBE, system may be slow because of correcting the SBE in the memory too often.
tkc
Esteemed Contributor

Re: Possible memory hw-error

oh ya, i noticed you have multibit error too? don't you have a system crashed due to this multibit error? any system panic or reboot that you didn't realise?
tkc
Esteemed Contributor

Re: Possible memory hw-error

if what i have suspected your system crashed due to the memory multibit error, then i would like to suggest you to collect the mca files located in /var/tombstones directory and have the mca file dated on the day the system crashed and have it sent to hp for analysis. hp would be able to provide you the answer for the crash.
Torsten.
Acclaimed Contributor

Re: Possible memory hw-error

Even a multibit error may not lead to a chrash because of the chip spare/chip kill technology in sx1000 systems like yours. This is even improved for systems using the sx2000 chipset.

Read about it (manual is for different systems, but it applies more or less):

http://h71028.www7.hp.com/ERC/downloads/rx4640_2600_1600_wp_FINAL_4-14-04.pdf?jumpid=reg_R1002_USEN

page 27
"ECC and chip spare memory"

more for sx2000 systems:

http://h71028.www7.hp.com/ERC/downloads/c00767235.pdf?jumpid=reg_R1002_USEN

Anyway, a single bit error may not be a problem. Regarding the multi bit error:
Let HP support analyze your logs. They will you help further.

Hope this helps!
Regards
Torsten.

__________________________________________________
There are only 10 types of people in the world -
those who understand binary, and those who don't.

__________________________________________________
No support by private messages. Please ask the forum!

If you feel this was helpful please click the KUDOS! thumb below!   
john korterman
Honored Contributor

Re: Possible memory hw-error

Thank you very much for your quick responses!

As mentioned, there are no indications of any hw-problems in syslog or the dmesg output.
And the OS has not crashed a single time, but Sybase has - on numerous occations - however, without any suspicious messages in syslog/dmesg.
I shall be making arrangements for having the DIMM with the multi-bit error replaced, in order to be able to rule out hw-errors.
If that does not solve the problem, I will return in another thread!

Thanks again,

John K.
it would be nice if you always got a second chance
john korterman
Honored Contributor

Re: Possible memory hw-error

thanks!
it would be nice if you always got a second chance