Re: Repairing Rdb corrupt pages

Jeremy Begg · ‎02-09-2012

Hi,

I inherited responsibility for a system a little over 12 months ago. It's an AlphaServer DS20 running VMS 7.2-1 with Rdb 7.0-5 and application called DECfin. I don't have much Rdb experience but this system is slated for retirement Real Soon Now so the customer has not wanted to devote a lot of time to it.

Today they alerted me to an error in the RMU backup job:

$       rmug -
                /backup -
                /online -
                /lock_timeout=3600 -
                /log -
                $1$dra3:[decfin22.data]catch22.rdb -
                $1$dra4:[decfin_rmu]ca22.rbf
%RMU-I-QUIETPT, waiting for database quiet point
%RMU-E-CORPAGPRES, Corrupt or inconsistent pages are present in area $1$DRA3:[DECFIN22.DATA]CATCH22_AREA02.RDA;1
%RMU-F-FATALERR, fatal error on BACKUP
%RMU-F-FTL_BCK, Fatal error for BACKUP operation at 9-FEB-2012 22:30:02.55

They casually mentioned this had been happening "for a few days" and a review of the previous log files for this job indicates this started on 2nd Feburary i.e. a week ago. I also found this in OPERATOR.LOG:

%%%%%%%%%%% OPCOM   2-FEB-2012 22:05:44.23 %%%%%%%%%%%
Message from user SYSTEM on ORFF
Oracle Rdb V7.0-5 Database DISK$DECFIN22:[DATA]CATCH22.RDB;1 Event Notification
Page 3:743742 checksum error - computed 4684D055, page contained 525468E9; retrying disk read

%%%%%%%%%%% OPCOM   2-FEB-2012 22:05:48.65 %%%%%%%%%%%
Message from user SYSTEM on ORFF
Oracle Rdb V7.0-5 Database DISK$DECFIN22:[DATA]CATCH22.RDB;1 Event Notification
Page 3:743742 checksum error - computed 4684D055, page contained 525468E9; retrying disk read

%%%%%%%%%%% OPCOM   2-FEB-2012 22:06:19.06 %%%%%%%%%%%
Message from user SYSTEM on ORFF
Oracle Rdb V7.0-5 Database DISK$DECFIN22:[DATA]CATCH22.RDB;1 Event Notification
Page 3:743742 checksum error - computed 4684D055, page contained 525468E9; retrying disk read

%%%%%%%%%%% OPCOM   2-FEB-2012 22:06:32.47 %%%%%%%%%%%
Message from user SYSTEM on ORFF
Oracle Rdb V7.0-5 Database DISK$DECFIN22:[DATA]CATCH22.RDB;1 Event Notification
Page 3:743742 checksum error - computed 4684D055, page contained 525468E9; retrying disk read

The RMU backup starts at 22:30 each night so the timestamps above are consistent with the backup job first failing on 2-Feb-2012. There are no other Rdb messages in OPERATOR.LOG (the system has been up since 20-Nov-2011).

RMU/SHOW CORRUPT indicates a single page is corrupt:

$ rmug/show corrupt DRA3:[DECFIN22.DATA]CATCH22
*------------------------------------------------------------------------------
* Oracle Rdb V7.0-5                                     10-FEB-2012 13:57:48.26
*
* Dump of Corrupt Page Table
*     Database: DRA3:[DECFIN22.DATA]CATCH22.RDB;
*
*------------------------------------------------------------------------------

Entries for storage area AREA02
-------------------------------

    Page 743742
        - AIJ recovery sequence number is -1
        - Live area ID number is 3
        - Consistency transaction sequence number is 0:0
        - State of page is: corrupt

*------------------------------------------------------------------------------
* Oracle Rdb V7.0-5                                     10-FEB-2012 13:57:48.27
*
* Dump of Storage Area State Information
*     Database: DRA3:[DECFIN22.DATA]CATCH22.RDB;
*
*------------------------------------------------------------------------------

All storage areas are consistent.

$

My limited understanding of Rdb is that this can be fixed fairly easily using the command

$ RMUG/RECOVER/JUST_CORRUPT/ONLINE/LOG DRA3:[DECFIN22.DATA]CATCH22

provided I have a backup from before the problem occurred, which would be the one from 1st Feb (if that's still available on tape) or 27th Jan (the last "weekly" backup, which is kept for longer).

However, this database does not have AIJ enabled, so does that mean the page in question will remain "stale" and possibly inconsistent with the rest of the database? (The previous system manager, who was reasonably proficient with Rdb, disabled the AIJs in mid-2010 for reasons unknown.)

As an aside, I found Jean-Francois Pierronne's comments on this subject in another thread from 2005 in which he suggests simply closing and opening the database may fix it. So I'll try that first.

Thanks,

Jeremy Begg

Jean-François Piéronne · ‎02-09-2012

Hi Jeremy,

First check what is contains in this page, data or index.

you can fix the uncorrect checksum using rmu/repair (base closed).

Then issue a rmu/verify/all on you database.

using rmu/restore/just will restore the corrupt page which will become unconsistent.

You can clear the unconsistent flag (using rmu/set corrupt, or using rmu/alter), then verify the database

if it is an index just drop and recreate the index.

If it's a data page you may have lost some update...

JF

Brad McCusker · ‎02-10-2012

Jeremy,

The lack of AIJs definitely complicates matters.

The first thing to do, is to run a full verification on your database – this will give you an idea as to the scope of the problem (is it limited to only the one page, or is it more widespread, and is the nature of the corruption only “checksums” or page formats, AIP issues, etc.)

Once the db has been verified (and you are ready to move to the “repair” phase), we strongly recommend that you shutdown the database and do a VMS backup of all the files that make up the database. This way, you will be able to get back to the point you are at right now. :) You can’t buy this afterwards.

If it turns out that it is just the checksum on the one page, then you are in luck (sort of). You can use rmu/alter to put the correct checksum on the page (if the page is indeed messed up, it won't fix that problem). You can then do a "verify" of the page from RMU alter to sanity check this page. Afterwards, you will need to use RMU/SET CORRUPT to clear the corrupt page table entry in the root file (otherwise, you will not be able to perform backups).

If it turns out that the page is really messed up, then it becomes more interesting. You could try to recover the specific page from the last backup (clearly, changes since then will be missing – which could result in inconsistencies between the data that was supposed to be on this page and other structures). After you do this, you would need to use rmu/set corrupt <root>/area=<area>/consistent to make the area appear” consistent (again, changes made since the backup will be lost – and logical and physical inconsistencies could still exist).

RMU/Dump of the page before and after a restore (of the corrupt page) might be useful in determining if there were modifications to the page since the last backup. Also dumping the page would provide the timestamp of when the page was last updated (provided that the timestamp isn't part of the corruption).

Good luck - let me know if we can be of further help,

Brad McCusker

Software Concepts International

www.sciinc.com

Brad McCusker
Software Concepts International

Jeremy Begg · ‎02-11-2012

Hi JFP and Brad, thanks for your input.

FWIW, I have attached the output of RMUG/DUMP for this database.

JFP said, "First check what is contains in this page, data or index" ... how do I do that. (Yes, I really know very little about Rdb.)

I shut down the applications and did RMUG/CLOSE on all databases. RMUG/SHOW CORRUPT returned the same result as previously. I then ran RMUG/VERIFY/ALL on the database and it returned without reporting any output. I then tried RMUG/VERIFY/AREA which reported quite a few messages like this:

%RMU-W-PGSPAMENT, area RDB$SYSTEM, page 68697
the fullness value for this data page does not match
the threshold value in the space management page
expected: 3, computed: 0

...

%RMU-W-PGSPAMENT, area AREA01, page 216985
the fullness value for this data page does not match
the threshold value in the space management page
expected: 3, computed: 0

...

%RMU-W-PGSPAMENT, area AREA02, page 1679618
the fullness value for this data page does not match
the threshold value in the space management page
expected: 0, computed: 3
%RMU-E-CORRUPTPG, Page 743742 in area AREA02 is marked as corrupt.

$

but didn't give any other information about the nature of the corruption.

The second attachment is a dump of the "corrupt" page as well as the page immediately before and after it. To this untrained eye it doesn't give any clue as to what's wrong.

The "corrupt" page has a timestamp of 31-JAN-2012 12:43:45.29 but I'm not sure I can locate a valid backup made between then and when the corruption was noted (2-Feb 22:06). This probably means if I do have to restore from backup it's going to be loading stale data :-(

Thanks,

Jeremy Begg

Chris Barratt · ‎02-14-2012

Hi Jeremy,

To restore just the page, you would want the first valid backup back from before the 2-Feb time.

Looking at the page dump, it contains some nodes of an index, so you may have a choice even if you can;t restore the page - you may be able to drop the index and recreate it. (Of course, the DROP may fail due to the corrupt page, but it would be worth a try).

If you are able to restore the page, it may be worth recreating the index anyway, just to be sure it is up to date and includes any changes since 2nd May - though I suspect if there had been any attempts to update they would have crashed due to the corrupt page.

Which index is it ?

Well the bit that says "index: set 198" on each b-tree node tells us that the logical area number of the index is 198.

So do,

$ rmu/dump/larea=rdb$aip/out=a.lis database_name

then

$search/win=8 a.lis "area 198"

That will show you the area name, which should correlate to an index name.

Hiope that helps,

Cheers,

chris

Jeremy Begg · ‎02-16-2012

Hi Chris,

I think we're getting close now ... here's what I found after dumping the SYS$AIP larea ...

entry #5
00020A7D 014B first area bitmap page 133757
0003 00C6 014F logical area 198, physical area 3
0D 0153 area name length 13 bytes
00000058444E5F424F4A5F52545F4343 0154 area name 'CC_TR_JOB_NDX...'
000000000000000000000000000000 0164 area name '...............'
000000DA 0173 snaps enabled TSN 218
00D7 0177 record length 215 bytes
00000000 0179 MBZ '....'
01 017D entry is in use
0000 017E MBZ '..'
000000 0180 thresholds are (0,0,0)
1A 0183 ** junk **
00 0184 record type unknown
00 0185 MBZ '.'

Looks to me like index CC_TR_JOB_NDX is the one to recreate.

I'll post an update when I've been able to try it.

Thanks!

Jeremy

Jeremy Begg · ‎02-26-2012

I think we're up and running again!

With the help of a friendly local (thanks Mark!) the index in question was dropped, the page removed from the "corrupt pages" table, and the index rebuilt. I ran RMU/VERIFY/ALL and the only complaints were like this:

%RMU-W-PGSPAMENT, area AREA01, page 218375
the fullness value for this data page does not match
the threshold value in the space management page
expected: 3, computed: 0

which I don't think is a serious error.

RMUG/BACKUP now runs instead of immediately halting because of the corrupt page.

Thanks to JFP and Brad for your helpful comments, to Chris for telling me how to identify which index was broken, and to Mark Hurcombe for hands-on assistance on a Sunday morning.

Regards,

Jeremy Begg

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Repairing Rdb corrupt pages

Repairing Rdb corrupt pages