topic Re: OS vs Oracle on failing drive in Operating System - HP-UX

OS vs Oracle on failing drive

Elena Leontieva — Mon, 25 Oct 2004 08:22:01 GMT

Hi,

I just want to share this story and get your comments.

A couple days ago, a production Oracle DB halted twice with the following errors in alert.log:
ARC0: Beginning to archive log# 5 seq# 47
ARC0: Failed to archive log# 5 seq# 47
Thu Oct 21 12:50:24 2004
Log corruption near block 224 change time
All Archive destinations made inactive
ARC1: Failed to archive log# 5 seq# 47
ARCH: Archival stopped, error occurred. Will continue retrying Thu Oct 21 12:50:24 2004 ORACLE Instance PLIN - Archival Error
ARCH: Connecting to console port...
Thu Oct 21 12:50:24 2004
ORA-16038: log 5 sequence# 47 cannot be archived
ORA-00354: corrupt redo log block header
ORA-00312: online log 5 thread 1: '/u03/data/oradata/PLIN/log5.log'

Basically the redo logs became corrupt. This kind of error pointed to the HW, i.e bad disk drive and DBA moved all redo logs off the local drives and put them onto SAN storage.

The local drives was vg01, 8 drives, 4 drives mirrored over the other four. We ran diskinfo and dd on all drives - no errors at all. We did not see any errors in syslog.log either.

On a weekend, when rebooting this server (K460 , running HP-UX 11.00) I figured that one drive did fail and it was replaced.

Now I feel kind of uneasy... It looks like Oracle figured a bad drive prior to OS started to report errors. Moreover, even though the drive was mirrored, it really did not help at all! For some reason one failing drive in a mirrored pair caused a production problem. Is there anything I can do about this other than move data off the local drives?

Your opinions are appreciated.
Elena.

Re: OS vs Oracle on failing drive

Steven E. Protter — Mon, 25 Oct 2004 08:32:10 GMT

I've had this happen to me in the past.

1) Shut the databaase and get a gold backup.
2) Use cstm or mstm or xstm (X wind) and run the excercize command on every disk in the system.
3) dmesg or vi /var/adm/syslog/syslog.log

If you find a bad disk arrange replacement.

These systems generally have larger numbers of small disks. Its unlikely though entirely possible that Oracle and the boot disk are the same disk.

I hope you have been doing make_tape_recovery tapes handy.

Its always a good idea to have vg00 seperate from your oracle data.

SEP

Re: OS vs Oracle on failing drive

Prashant Zanwar_4 — Mon, 25 Oct 2004 08:37:08 GMT

Hi,

In my openion OS shall be latest patched and also latest diagnostics, monitoring tool shall be installed....

Besides this, redo logs shall have multiple copies on the server, 3-copies I believe ateast shall be there for latest logs..And if possible keep the same on contigency copy also...You shall be running some script to copy logs of some time back only...

Hope this helps..

Prashant

Re: OS vs Oracle on failing drive

Hein van den Heuvel — Mon, 25 Oct 2004 09:02:58 GMT

>> even though the drive was mirrored, it really did not help at all! For some reason one failing drive in a mirrored pair caused a production problem

Well, Oracle does do basic sanity checking on the data. Thus it can report data problems, without IO errors.

The mirroring may actually hinder in finding a problem. Just imagine the HBA / cable injects a bad bits for the write to one of the members. Or one of the members does nto faithfully write through. Now it is going to be pot-luck ads to whether you see good data or bad data. You may be reading froma good disk most of the time, but under heavier load, you may get data from the other member, over a problem path.

>> Besides this, redo logs shall have multiple copies on the server, 3-copies I believe ateast shall be there for latest logs..And if possible keep the same on contigency copy also...You shall be running some script to copy logs of some time back only...

So you have multiple redo groups in Oracle with multiple members within each group, each of those members being LVM mirrored (for 4+ data copies). Here I would have expected Oracle to do the rigth thing when one member deliver doubtfull data.

fwiw,
Hein.