1820530 Members
2274 Online
109626 Solutions
New Discussion юеВ

memory single bit errors

 
SOLVED
Go to solution
Chandra Sekhar_5
Occasional Advisor

memory single bit errors

Hi
We are running Oracle 9 database on one of the HP server ( RP4440 ) and recently we are getting many failures on the db operations. Pls find the error trace as below.
Errors in file /u01/app/oracle/admin/PSFFA/udump/psffa_ora_29571.trc:

ORA-00600: internal error code, arguments: [17114], [0x800003FB800670D0], [], [], [], [], [], []

Thu Feb 2 06:41:02 2006

Errors in file /u01/app/oracle/admin/PSFFA/udump/psffa_ora_29571.trc:

ORA-00600: internal error code, arguments: [17147], [0x800003FB800670D0], [], [], [], [], [], []

Thu Feb 2 06:41:08 2006

Errors in file /u01/app/oracle/admin/PSFFA/udump/psffa_ora_29571.trc:

ORA-00600: internal error code, arguments: [17147], [0x800003FB800670D0], [], [], [], [], [], []

ORA-00600: internal error code, arguments: [17147], [0x800003FB800670D0], [], [], [], [], [], []

Thu Feb 2 06:41:09 2006

Errors in file /u01/app/oracle/admin/PSFFA/udump/psffa_ora_29559.trc:

ORA-00600: internal error code, arguments: [17147], [0x800003FB800673F8], [], [], [], [], [], []

Thu Feb 2 06:41:09 2006

Errors in file /u01/app/oracle/admin/PSFFA/udump/psffa_ora_29565.trc:

ORA-00600: internal error code, arguments: [17147], [0x800003FB00010668], [], [], [], [], [], []

Oracle suppport team debugged these errors and told us that there are "single bit memory error" causing this problem. But our CSTM logs not showing any memory errors. Is there any otherway to find out about memory related issues.

Thanks in advance,
Chandra
18 REPLIES 18
Torsten.
Acclaimed Contributor
Solution

Re: memory single bit errors

Hi,

run stm (cstm, mstm, xstm) info tool on the memory item. This will report you information about single bit errors. If you get no info about a single bit error, you probably have no error!

Hope this helps!
Regards
Torsten.

__________________________________________________
There are only 10 types of people in the world -
those who understand binary, and those who don't.

__________________________________________________
No support by private messages. Please ask the forum!

If you feel this was helpful please click the KUDOS! thumb below!   
Ivan Ferreira
Honored Contributor

Re: memory single bit errors

You can also configure The Event Monitoring System (EMS) with monconfig.

The hardware usually has a service processor that can show logs about hardware events.

Sinble bit errors can also be cause because of a inappropiate cooling of the system.
Por que hacerlo dificil si es posible hacerlo facil? - Why do it the hard way, when you can do it the easy way?
A. Clay Stephenson
Acclaimed Contributor

Re: memory single bit errors

Get a baseball bat (a Cricket bat will do in a pinch) and apply vigorously to the cranial areas of your Oracle support team; if CSTM is not reporting single-bit memory errors then you don't have them; moreover, single-bit errors with error-correcting memory should be invisible to an application --- that's why you have ECC memory; the application wouldn't have a clue that the parity bit even exists.
If it ain't broke, I can fix that.
A. Clay Stephenson
Acclaimed Contributor

Re: memory single bit errors

The good news is that the baseball bat will correct single-bit errors in the Oracle Support Team. Multiple treatments may be required.
If it ain't broke, I can fix that.
Torsten.
Acclaimed Contributor

Re: memory single bit errors

As always - trust A. Clay Stephenson and follow his recommendations!
;-)

Hope this helps!
Regards
Torsten.

__________________________________________________
There are only 10 types of people in the world -
those who understand binary, and those who don't.

__________________________________________________
No support by private messages. Please ask the forum!

If you feel this was helpful please click the KUDOS! thumb below!   
Chandra Sekhar_5
Occasional Advisor

Re: memory single bit errors


I agree with sthephen. I dont think there is any memory issue as we dont see any event from CSTM. But just to be safe side before discussing with Oracle team, just want to confirm on the CSTM version side too..i am running CSTM A.47.00 , does this have any limitation in reporting memory errors..do i need to updgrade ?!
Torsten.
Acclaimed Contributor

Re: memory single bit errors

A.47.00 is version Dec 2004 for HP-UX 11.11, A.49.10 (Sep 05) is current.

see http://docs.hp.com/en/diag/stm/stm_upd.htm

To be sure you should check our firmware (PDC) level! This is the only important point regarding single bit errors.

Hope this helps!
Regards
Torsten.

__________________________________________________
There are only 10 types of people in the world -
those who understand binary, and those who don't.

__________________________________________________
No support by private messages. Please ask the forum!

If you feel this was helpful please click the KUDOS! thumb below!   
Chandra Sekhar_5
Occasional Advisor

Re: memory single bit errors

Hi
My PDC version is 45.11 . Is that fine or does it requires any upgrade ?

Tom Danzig
Honored Contributor

Re: memory single bit errors

Chandra,

Clay is not only very funny, but correct. There is no issue with your machine's memory. If you had any single bit memory errors they would be in your syslog (as well as reported by STM) such as:

Feb 3 11:09:28 servername vmunix: LPMC type : SEDC (ECC-corrected single-bit error)

Get a bat ;^)
Andrew Merritt_2
Honored Contributor

Re: memory single bit errors

Hi Chandra,
Just to concur with the above comments, it is extremely unlikely that SBEs are causing the problem with Oracle.

To expand on the OnlineDiags behaviour, and correct a couple of the comments above:

A.47.00 is not the latest version of OnlineDiags, but it is supported, and there aren't any relevant known problems with it. PHSS_33673 is the latest patch for it, and you should have that installed (you can tell if you run STM, the version will be shown as A.47.15 (I think)).

If there have been SBEs detected, you should see these listed when you run the Memory Info Tool in STM.

What you won't see is events in syslog, or in /var/opt/resmon/log/event.log for individual Single Bit Errors. Since these are normal events, at a low frequency, and since the error correction takes care of them, they no longer generate EMS events. This is because some customers have been unnecessarily alarmed when they see them.

If there is a significant number of SBEs in a short time, then you will see EMS events being generated warning of this and indicating the hardware needs to be replaced.

Andrew
Steven E. Protter
Exalted Contributor

Re: memory single bit errors

ORA-00600:

Oracle knows something is wrong, is not going to work any more. Oracle has NO IDEA what is wrong.

Don't blame the hardware, its not causing this.

SEP
Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
A. Clay Stephenson
Acclaimed Contributor

Re: memory single bit errors

The whole point is that you were given a completely bogus answer by your Oracle guys and that you should communicate that back to them very strongly. Ask them very simple questions like "Where are your data to support this claim?" Single-bit errors under ECC -- which you have -- are corrected "on the fly" and are completely invisible to an application. A SBE would never make itself known to an application.
If it ain't broke, I can fix that.
Chandra Sekhar_5
Occasional Advisor

Re: memory single bit errors


Pls find the oracle support response as below .

It appears the error is a single bit swap and does not line up with a known Oracle bug. Based on this the AR team is asking that hardware diagnostics be run on the server.

- - - -
From the error in trace file psffa_ora_29567.trc

Chunk 800003fb0003f938 sz=1074156928 ERROR, BAD MAGIC NUMBER (800003FAC0065580
)
The lower 8 bytes in (800003FAC0065580), C0065580 indicate the sizo of the the chunk.

The sz=1074156928 in hex is sz0x40065580.

Looking in the dump of addr=0x800003FB0003F938

Dump of memory from 0x800003FB0003F8F8 to 0x800003FB0003FA38
800003FB0003F8F0 00000000 00000000 [........]
800003FB0003F900 00000000 00000000 00000000 00000000 [................]
Repeat 2 times
800003FB0003F930 00000000 00000000 800003FA C0065580 [..............U.] <---------Here
800003FB0003F940 800003FB 0003F480 40000000 0053C9F8 [........@....S..]
800003FB0003F950 00000FA0 00000000 00000000 00000001 [................]

We see the value is C0065580 instead of 40065580
C in binary is 1100.
4 in binary is 0100.

As you can see, it seems there is a one bit failure. Instead of 4 we have C.
A. Clay Stephenson
Acclaimed Contributor

Re: memory single bit errors

I agree that the data differs by one bit. That does not imply that the hardware is responsible; it only means that the data have changed by one bit over some time interval --- probably by a software instruction --- even if unintentional. However, if this were memory induced, ECC would correct this SBE "on the fly" and the dump would never see it. Moreover, a message would be sent to syslog indicating that the error was detected and corrected.

Run the diagnostics on your box and report the null results to the Oracle guys and then have them find the real problem.
If it ain't broke, I can fix that.
Andrew Merritt_2
Honored Contributor

Re: memory single bit errors

It's a big jump from saying the data is not what they expect to see to saying it must be a hardware problem, there's all sorts of software reasons that could cause this.

If the STM memory tool is not showing any errors, then I think you are entitled to tell Oracle that you have checked the hardware and that the ball is back in their court to do some serious troubleshooting rather passing the buck.

Do you have the latest patches applied for Oracle?

What settings do you have for the Oracle parameters "db_block_checking" and "db_block_checksum"? Setting these to 'true' has been seen to reduce the occurrence of these errors with Oracle 8 and Superdome systems in the past.

Andrew
Andrew Merritt_2
Honored Contributor

Re: memory single bit errors

> Moreover, a message would be sent to syslog
> indicating that the error was detected and
> corrected.

Unless I'm getting confused with IA behaviour, I don't think that part is true. Individual SBEs shouldn't now be logged to syslog, nor to event.log, though you should see them in the logtool output in STM, and the corresponding entries should be present in the PDT, viewable with the STM Memory Info tool.

Andrew
HGN
Honored Contributor

Re: memory single bit errors

Hi

With so many people helping on this issue, if this issues is resolved I think you should assign points for the people who have been spending their valuable time helping you out. You have not aasigned a single point to anyone.

Rgds

HGN
Chandra Sekhar_5
Occasional Advisor

Re: memory single bit errors


Now, the ball is in Oracle court..they eliminated reason of physical memory issue..

Thanks a lot for your quick suggestions and
sorry that i didnt give points to my last questions. This group is really amazing and the thanks to each individual who has provided inputs to this.I will definately provide points to each responder.

Thanks again.
-Chandra