Re: ORA-01578: ORACLE data block corrupted

Ralph Grothe · ‎10-19-2004

Hello,

DBAs catch the Oracle error given in the subject line when performing certain tasks.
They insist that this error relates to some HW defect.
However, I cannot identify such.
As they obtain another Oracle error which generously points to the assumed corrupt file,
(viz. ORA-01110: data file 1: '/export/oracle/dbs1/inwb01/sys/sys01inwb01.dbf')
I checked the volume for stale extends,
but everything looks ok.

Neither do stale LEs appear

# lvdisplay -v $(vgdisplay -v vg02|awk '/LV Name/{print$NF}')|grep -ci stale
0

nor stale PEs

# pvdisplay -v $(vgdisplay -v vg02|awk '/PV Name/{print$3}')|grep -i stale
Stale PE 0
Stale PE 0
Stale PE 0
Stale PE 0

The disks come from a A6218A HP RAID Box.
Unfortunately, I'm not too familiar with the plethora of admin commands for this thing.
From "man armdsp" I discovered that it supplies an -a switch which shows the state of fairly all components and ingredients.

When I perform this statement

armdsp -a $(armdsp -i $(uname -n)|awk '/nique/{print$NF}')

I get what I attached as an upload (sorry, for the waste of bandwidth)
I'm not knowledgable enough to understand all of that, but the gist looks pretty ok to me (especially the seperate per disk lines)

Is there anything else I could check to dismiss a HW problem?

Is a filesystem check sensible (in view of my other evidence), and can it only be performed on an unmounted filesystem, although this box is MCOE (i.e. contains stuff like OnlineJFS)?

Rgds.
Ralph

Madness, thy name is system administration

Sanjay_6 · ‎10-19-2004

Hi Ralph,

If this is different from your other post, then check the syslog and the EMS log to see if any hardware error is logged over there.

If this is on the same system for which you have another post, that problem could be the cause of this error.

Hope this helps.

regds

Simon Hargrave · ‎10-19-2004

Do you have EMS running? If so this should have highlighted any hardware issues. Check /var/opt/resmon/log/event.log. There should also be ESM messages in syslog.log.

Check also dmesg output for SCSI errors, etc.

You can also run stm and get an Information on the disk in question. This should show you if there were any disk errors recorded. (select the disk in question, tools menu, information, run information). You'll see read errors/ write errors etc here.

If not, then try (at a quiet time) this simple test: -

dd if=/export/oracle/dbs1/inwb01/sys/sys01inwb01.dbf of=/dev/null

this will read through the whole file, and bin the results. The point is, if this succeeds, the disk itself isn't faulty and the problem lies with the data itself.

Ralph Grothe · ‎10-19-2004

EMS agents are running.
I checked with the (c) option in
/etc/opt/resmon/lbin/monconfig

The /var/opt/resmon/logs/event.log
has as its latest record an entry from Mar 7 2004.

No, this is totally unrelated to my other thread.
In fact these boxes are fairly recent rp7410 clustered irons, well equipped, opposed to the K-class case in the other thread.

Yes, I usually read from the disk device to /dev/null to verify that at least can be read from a suspicious disk.

Madness, thy name is system administration

Volker Borowski · ‎10-19-2004

Hi Ralph,

if the databse runs archivelog mode, I'd first recommend to not shut down this database until this one is fixed.
Stay with online backups!
Stop overwriting tapes which contain old redologs and backups at once !

Second: is this DB a copy of another one ?
If yes, did you do an online copy or an offline copy from a shutdown aborted source database ?
In this case there is a good chance that the 1578 is related to some recovery problems for segments that have been build with the nologging clause -> Details @ Metalink.

Let the DBA do a dbverify check on the file in charge. It should give you some additional info about the error.
If it does not find any error, we need to check how this error is reproduced.

Restore the datafile in charge from a backup to a seperate location and do a dbverify check on the restored file.
If the error is not in the restored file, and you have all redologs since then, you are able to recover this.
If the corruption is even in the oldest backed up file from where you have redologs until now you are in trouble.

Next step would be to try to export the database (if not too big).

What Oracle release are you on ?

Regards
Volker

Steven E. Protter · ‎10-19-2004

Hi Ralph,

Nice to see you, sorry you are having trouble.

Unfortuneately, you may still have had a hardware problem.

Fire up xstm csm or mstm and run disk excersize on all involved disks. Even it its a LUN presented from a disk array, it needsd to test out properly.

We've had some glitches with our disk array itself, that corruped that data in oracle. We were forced to restore. Ours involved a firmware upgrade disrupting disk operation. The disks tested out fine after the disruption but the data on them was totally useless.

If a momentary disruption of your disk array occurred, there may be no record of it. dmesg or /var/adm/syslog/syslog.log may or may not show something. The disk array log might be a good place to look.

SEP

Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com

Volker Borowski · ‎10-19-2004

... forgot: The "bible" note for blockcorruption in Oracle is Metalink-Note:28814.1

V.

Ralph Grothe · ‎10-19-2004

Things are getting worse.
Clients now complain about bad performance.
Having looked with glance at it I can see that the servers are instead even underloaded.
I.e. no CPU usage, very little disk I/O, low network packet rates.
When I look at the wait stats of certain Oracle procs I can see that they are blocked on STRMS.
Looks like the Oracle ICP is impaired.
All four listeners on both cluster nodes are running.
If however, the bad data block that Oracle reports, refers to a table in the data directory that is responsible for keeping persistance of Oracle/SQLNet IPC it sounds plausible to me.
I also called for help from network and firewall admins to look at their hops if something crashed, resetted, overflew etc.
because the NIC settings (modes link speeds etc.) and routing look ok.

Volker,

how can I query the instance if it is running in archive logging mode?
I cannot identify the current parameter file.
There are too many, looks DBAs have done quite a bit of parameter testing.
Albeit, I can connect to an instance (n.b. sqlplus tells it's release 9.2.0.4.0) /as sysdba.
But I don't know the view from the data directory that I have to query.
Could you tell me (e.g. like v$instance)?
Btw. if I connect locally to an instance does Oracle use a Unix socket for IPC (i.e. do I require the listener to be running?)?
I will pass your valuable suggestions to our DBA, but I cannot reach her today anymore.
So please stay tuned to this thread.
DB backup/recovery seems a binary business, as your final statement suggests.
Either got everything, or nothing.

SEP,

first thing I tried was to look at the disks through stm.
But in the map there I cannot see the single disks of the RAID box.
Doesn't look like stm supports it.
Apart from that I can only run informational requests, since I lack knowledge of passwords for execrcising or other tasks.

Madness, thy name is system administration

Ralph Grothe · ‎10-19-2004

Volker,

what is a Metalink in Oracle lingo?
Anything comparable to a Unix (hard)link?

Madness, thy name is system administration

Sanjay_6 · ‎10-19-2004

Hi Ralph,

Metalink for oracle is like ITRC is for hp.

Hope this helps.

Regds

Jean-Luc Oudart · ‎10-19-2004

Ralph,

to check if the database is in archive mode, check the v$database view :
select * from v$database;
check LOG_MODE

metalink is the Oracle support site to search for problem, report problems and also
forums.

Regards
Jean-Luc

fiat lux

Brian Crabtree · ‎10-19-2004

Ralph,

Ok, first, what version of Oracle are we running?

Second, have the DBA's used the Oracle data verification tool ($ORACLE_HOME/bin/dbv) on this datafile, and all the other datafiles.

Corruption in the SYSTEM tablespace is very bad, as it will almost guarentee that backups are invalid and will not be usable.

The best/only option at this point is to have the DBA's do a full export of the database, destroy everything, recreate it, and reload the database.

Now, to answer your thoughts. This does not necessarily mean that it is a hardware failure. Depending on the version of Oracle, and what is happening, it is possible that it is logical corruption of the database, rather than physical corruption. Checking all of the files on the database for corruption is the easiest way to test for this, as physical corruption tends to spread (short of an intermittent RAID failure, which is a whole other way we can go).

Unless this is test/development, I would work on getting the database put into a stable place first, and move from there.

Let me know if you have any questions.

Thanks, Brian

Jean-Luc Oudart · ‎10-19-2004

Ralph,
to obtain oracle product version :
select * from v$version

Regards
Jean-Luc

fiat lux

Ralph Grothe · ‎10-19-2004

Hello again,

there are seven Oracle instances running per cluster node (two-node active/active MC/SG cluster).
The servers are, though furnished with 6 gigs of RAM, a bit under memory pressure right now.
vmstat as well as glance report intensive paging activity @ 100% mem usage and less than 100mb free memory.
However, in glance yet I cannot detect actual swap outs (aka deactivations).
I don't know how big the so called SGA (is it shared mem segs + mmapped pages?) per instance is.
But there also must exist a data dict. view that yields this entity.
I think if I use the ipcs command I only get segments that were claimed by procs through the shmget() syscall, but it won't display those pages allocated through mmap() syscalls, right?

# ipcs -mob|awk '$5=="ora920"{s+=$NF};END{printf"%10.2f MB\n",s/2**20}'
1514.28 MB

Unfortunately the manpage of ipcs doesn't reveal if the SEGSZ value is given in bytes.
I guess one would have to refer to some system header file to make sure.

The instance seem to be running in archive logging mode.

$ sqlplus

SQL*Plus: Release 9.2.0.4.0 - Production on Mi Okt 20 11:44:19 2004

Copyright (c) 1982, 2002, Oracle Corporation. All rights reserved.

Enter user-name: /as sysdba

Connected to:
Oracle9i Enterprise Edition Release 9.2.0.4.0 - 64bit Production
JServer Release 9.2.0.4.0 - Production

SQL> select log_mode from v$database;

LOG_MODE
------------
ARCHIVELOG

The broken data block problem I've been told was fixed meanwhile by the DBAs.

So there indeed doesn't seem to be a latent lurking HW defect.

Thus I gather that all performance and latency issues reported now from clients relate to shortage of memory rather than anything else.
I advised them to either get rid of a couple of processes (something they never would aggree to), or buy more RAM.

The highwater mark of device swap used since its initialization reads some 1.4 gigs, which comes close to the shm segs I summed up above.
So I guess many pages of it not being used were paged out to it once in a while.

# swapinfo -tam
Mb Mb Mb PCT START/ Mb
TYPE AVAIL USED FREE USED LIMIT RESERVE PRI NAME
dev 4096 1481 2615 36% 0 - 1 /dev/vg00/lvol2
dev 6664 0 6664 0% 0 - 2 /dev/vg00/lvol11
reserve - 6581 -6581
memory 4757 1007 3750 21%
total 15517 9069 6448 58% - 0 -

Do you have better suggestions what could be done to "increase performance" with the HW given (e.g. what not vital processes could be stopped)

Madness, thy name is system administration

Ralph Grothe · ‎10-19-2004

Oops, forgot to supply the Oracle banner (for the version)

$ echo "select banner from v\$version;"|ORACLE_SID=inwb01 sqlplus /as\ sysdba

SQL*Plus: Release 9.2.0.4.0 - Production on Mi Okt 20 12:21:36 2004

Copyright (c) 1982, 2002, Oracle Corporation. All rights reserved.

Connected to:
Oracle9i Enterprise Edition Release 9.2.0.4.0 - 64bit Production
JServer Release 9.2.0.4.0 - Production

SQL>
BANNER
----------------------------------------------------------------
Oracle9i Enterprise Edition Release 9.2.0.4.0 - 64bit Production
PL/SQL Release 9.2.0.4.0 - Production
CORE 9.2.0.3.0 Production
TNS for HPUX: Version 9.2.0.4.0 - Production
NLSRTL Version 9.2.0.4.0 - Production

SQL> Disconnected from Oracle9i Enterprise Edition Release 9.2.0.4.0 - 64bit Production
JServer Release 9.2.0.4.0 - Production

Madness, thy name is system administration

Yogeeraj_1 · ‎10-19-2004

hi ralph,

You will need to take your last backup of system01.dbf, restore it and roll it forward. Hopefully you are in archivelog mode and have been doing backups.

We can easily recover from this -- assumming you have backups. In general, in order to perform a block level recovery YOU MUST have used rman and be using rman, otherwise this feature is not available.

normally, if you cannot correct it via standard recovery techniques (from backups) please contact support for assistance.

let us know more about how it goes on.

regards
Yogeeraj

No person was ever honoured for what he received. Honour has been the reward for what he gave (clavin coolidge)

Volker Borowski · ‎10-20-2004

Hi Ralph,

well Metalink is THE information/software/support - source for Oracle. You surely should ask your DBAs to get an account. You can check out notes and guides for OS-specific implementation issues liek kernel-parameters, patches and else.

OK, the block-corruption is fixed !
Did they tell you how ?
Did they do export/import ?
Did the re-create indexes/temp-tables to do the recovery of "nologging"-actions ?

Both actions could have fixed the block-corruption but also did delete the statistics on all/some indexes involved.
May be the lack of performance now comes from the fact the the newly rebuild indexes have no statistics now, which would be a good reason, why the fix of the (assumed) index-corruption goes with decrease of performance.

Ask the DBAs if they recalculated the statistcs.

In addition, ask them, if in your database there is DB-activity with NOLOGGING involved ! In this case, even a clusterswitch may lead to those 1578s, because if the instance crashes or goes down with a shutdown abort. In this case, a 1578 on one of these objects is extremly likely.
You may need to adjust your cluster environment in this case.

Option one: Stop using NOLOGGING -> this is costy in terms of performance

Option two: Check your startup-procedures in case of a cluster-switch. It might not be sufficiant just to startup the database, but may be you need to run additional scripts to clear the objects created with nologging, before your application can continue to be productive.

Hope this helps
Volker

Brian Crabtree · ‎10-20-2004

Ralph,

Time to get new DBA's. :)

Ok, ways to get information on the SGA.

select * from v$sga;
select name,value from v$parameter where name in ('db_cache_size','pga_aggregate_target','sort_area_size','shared_pool_size','sga_max_size','processes','db_block_size','db_block_buffers');

Ok, this will give you a quick run-down of what is going on with the system. If they are using PGA_AGGREGATE_TARGET (non-zero number above) then you can add that to the SGA_MAX_SIZE and get a rough estimate of how much memory is being used on each database. If they are not using PGA_AGGREGATE_TARGET, then a really rough estimate of memory can be gotten by multiplying SORT_AREA_SIZE by PROCESSES.

You should also check the dbc_max_pct and dbc_min_pct UX kernel parameters. These should be set around 5-7 or so for a straight DB server, as oracle does its own buffering.

Again (can't stress this enough), the database with the corrupted SYSTEM datafile needs to be fixed first, before trying to tune performance. I would be suprised if the database was working with that, let alone performance problems with it.

Let me know if you have any questions.

Thanks, Brian

Mel_12 · ‎10-20-2004

It appears that Ralph is playing DBA here. Unless this instance is not production by nature, I strongly suggest the company get hold of a DBA. You don't know what Metalink stands for, you sound like you have not heard about SGA and perhaps only mundane knowledge of shared pool and RMAN.

Threads like this are better served if the person in need is encouraged to get their DBA to respond to questions. Specialized knowledge areas should be respected and not toyed with. There are unemployed DBAs out there.....

Regards,
Mel

Ralph Grothe · ‎10-20-2004

Hi all,

many thanks for your invaluable hints how to cope with Oracle inconsitancies.
Because, as Mel rightly stressed, I lack knowledge of even Oracle basics I asked our DBAs to have a look at this thread.
I cannot tell how they fixed things, and if everything state of the art that should be done in such circumstances has been excercised.
I however feel an urgent need to undergo Oracle trainig myself because ever too often DB related problems are burdened upon us Unix sysadmins from our clients.
Be it that our customers' DBAs lack detailed knowledge, be it they aren't available when disaster strikes, or be it they are simply reluctant to get their hands dirty or to take up responsibility.
The involved systems are productive and no playground at all.
And if something is getting screwed up I bet they will point at us in search for the culprit, although their sites indeed employ DBAs.
I find this all very unsatisfactory but we have to face reality, and educate ourselves.
That's why I'm so grateful for the existence of such a forum that experts like you fuel.
There is however a big drawback when you voluntarily take over tasks of "deviating trades" that it sticks with you in the end.
But only the curses (viz. expectations, demands, responsibility), and not the blessings (viz. acknowledgement, payment).

Madness, thy name is system administration

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: ORA-01578: ORACLE data block corrupted

ORA-01578: ORACLE data block corrupted