Re: Oracle/RDB page corruption. Complete restore required.

Thomas Ritter · ‎07-19-2006

We had to restore the production oracle/rdb database because of the following error. Incremental recoveries failed. A complete delete and restore was performed. In the 10 years as system manager at this site, this is the third time such a problem has occurred First in 1996, then 2002 and now 2006.

From OPCOM we have

%%%%%%%%%%% OPCOM 19-JUL-2006 16:02:06.15 %%%%%%%%%%%
Message from user API_PROD on WIZ22
Oracle Rdb V7.1-441 Event Notification for Database
DSA30:[WIZ_CMPRD.DATA.RDB30]WIZARD_DATA.RDB;1
Requested page 15:330926, received page 0:0; retrying disk read

%%%%%%%%%%% OPCOM 19-JUL-2006 16:02:06.16 %%%%%%%%%%%
Message from user API_PROD on WIZ22
Oracle Rdb V7.1-441 Event Notification for Database
DSA30:[WIZ_CMPRD.DATA.RDB30]WIZARD_DATA.RDB;1
Process 2BE11FE2 generating bugcheck dump file
DISK071:[RDMBUGCHK]RDSBUGCHK.DMP;
Exception at 13792BD4 : PIOFETCH$VALIDATE_PAGE + 000003C4
%RDMS-F-CANTREADDBS, error reading pages 15:330926-330926
-RDMS-F-BADPAGRED, read requesting physical page 15:330926 returned page 0:0

Output from one of the bugcheck dumps indicating page corruption

Alpha OpenVMS 7.3-2
Oracle Rdb Server 7.1.4.4.1
Got a RDSBUGCHK.DMP
RDMS-F-CANTREADDBS, error reading pages 15:330926-330926
RDMS-F-BADPAGRED, read requesting physical page 15:330926 returned page 0:0
Exception occurred at PIOFETCH$VALIDATE_PAGE + 000003C4
Called from PIOFETCH$FETCH + 00000AD4
Called from PIO$FETCH + 00000904
Called from PIO$UPDATE_FIB + 0000029C
Bugcheck accessing storage area CUSTOMER_DESCRIP_SA, area id 15
TSNBLK COMMIT_TSN higher than next TSN (0:429813568)
Line TSN higher than next TSN (0:429813568)
Running image JAVA$JAVA.EXE
Dump created: 19-JUL-2006 16:41:38.09
Database root: WIZ_CMPRD_DB:[000000]WIZARD_DATA
This bugcheck may have been caused by a corrupt TSN block.
The database should be verified to check for such corruption.
Suggested command: RMU/VERIFY WIZ_CMPRD_DB:[000000]WIZARD_DATA

Output from verify command
%RMU-W-SPAMFRELN, area CUSTOMER_DESCRIP_SA, page 330926
error in space management page's free space length
expected: 2996, found: 0
%RMU-W-PAGERRORS, 3 page errors encountered
2 page header format errors
0 page tail format errors
0 area bitmap format errors
0 area inventory format errors
0 line index format errors
0 segment format errors
1 space management page format error
0 differences in space management of data pages
%RMU-I-ESGPGLARE, completed verification of WIZ_CUSTOMER_DESCRIP logical area
as part of CUSTOMER_DESCRIP_SA storage area

Output from rmu/show corrupt

WIZ22_CMPRD $ rmu/show corrupt wizard_data
*------------------------------------------------------------------------------
* Oracle Rdb V7.1-441 19-JUL-2006 17:56:29.62
*
* Dump of Corrupt Page Table
* Database: WIZ_CMPRD_DB:[000000]WIZARD_DATA.RDB;
*
*------------------------------------------------------------------------------
Entries for storage area CUSTOMER_DESCRIP_SA
--------------------------------------------
Page 330926
- AIJ recovery sequence number is -1
- Live area ID number is 15
- Consistency transaction sequence number is 0:0
- State of page is: corrupt
*------------------------------------------------------------------------------

A page in one of the storage areas was "corrupted". Initial diagnose using RMU/Verify did not indicate a problem. We closed and opened the database. The RMU/Verify did then indicate a corrupt storage area.

Oracle's general reponse is that this problem was caused by hardware either memory or disk or both.

A patched version of Oracle/RDB was installed this week.

$ rmu/show ver
Executing RMU for Oracle Rdb V7.1-441

The previous version was Executing RMU for Oracle Rdb V7.1-401

We are running VMS 7.3-2.

The initial Oracle reponse shifts the onus to the system managers to prove that hardware was not the underlying cause. No hardware exception have been raised. Our Production environments host 9 oracle/rdb databases. Only one was broken.

What's your view ?

John Gillings · ‎07-19-2006

Thomas,

A disk error on a shadow set? How many members? If it's two or more members, a disk error seems unlikely. Are there any error log entries from the time window? I'd expect something for any type of memory or disk hardware error.

I don't see any condition codes for the read error, perhaps there's something in the bugcheck dump file?

A crucible of informative mistakes

Volker Halle · ‎07-19-2006

Thomas,

in a case like this, I would like to see an OpenVMS DUMP output of the database page(s) involved - but it may be too late to ask for that ;-(

That would allow you to 'see' the on-disk contents and may allow you to spot any unusual patterns. It should at least allow Oracle to diagnose the extent of the 'corruption' (just a byte/word/longword etc.). From the extent of the corruption, you could then speculate about possible reasons for this to have happened.

Volker.

Thomas Ritter · ‎07-19-2006

All disks are two member shadowsets. errlog.sys on all nodes looks good.
RDB was patched on the morning of the 18th.
V7.1-401 to 7.1-441.

Jean-François Piéronne · ‎07-19-2006

I have seen this problem a few times,

sometimes, on old version of Rdb this was a memory only corruption in the global buffer structure, but this problem is fixed for a long time. The workaround was just to close/open the database.

I have also encounter this problem on another site where the culprit was the firmware of some disks... But each time doing a rmu/restore/just then rmu/recover/just fixed the problem and this can be done online :-)
I strongly suggest you add journals (AIJ) to your database.

You may also check revision of firmware of your disks and controllers.

JF

Thomas Ritter · ‎07-19-2006

Jean,we tried online recovery work but failed. RDB reported a warning. The full restore was our last step taken. Restoring from tape has its own risks. All of our disk are shadowed. I just hope the patch is not the problem.

Jean-François Piéronne · ‎07-20-2006

And the warning was (just for my information)?

You may have some warnings, for example RMU-W-NOTRANAPP or RMU-W-USERECCOM
which are not fatal, these are just warnings.

JF

Jefferson Humber · ‎07-20-2006

All 3 times you have had to restore the DB, was it always the same error you were getting ?

I too have had Rdb corruption in the past, and get a similar response from Oracle.

Obviously over the years you have upgraded the OS & Rdb versions (7.3-2 & 7.1-441), but has the hardware always remained the same ?

Are running the latest F/W on your I/O sub-system ? What about ECO patches for 7.3-2 (FIBRE_SCSI in particular).

Jeff

I like a clean bowl & Never go with the zero

Thomas Ritter · ‎07-21-2006

JF, the error messages were different each time. I meant to write we have three RDB failures which required a complete restore. The failures were all different. The HW is upto date and are the VMS patches. We are forutunate in that testing our recovery procedures is an ongoing process.

Jeffrey Goodwin · ‎07-21-2006

Many years ago, we started receiving the 'retrying' RDB OPCOM message. The retrys must have always worked because we never received the second message bugcheck OPCOM message.

By moving the preferred paths around for the disks that contained our RDB database, I was able to determine that an HSJ controller was the common link. We replaced the HSJ controller and never saw the message again. I added the OPCOM message to my ConsoleWork's scan profile to ensure I'd pick it up again if it ever returned.

At no point did the HSJ controller/disks log any type of error message.

Based on my experience, I would agree with Oracle that you likely have a hardware issue.

-Jeff

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Oracle/RDB page corruption. Complete restore required.

Oracle/RDB page corruption. Complete restore required.