Re: RDB poor Recovery or Rollback performance.

Thomas Ritter · ‎02-12-2006

We have a 4 node Disaster Tolerant Cluster running VMS 7.3-2 and Oracle Rdb V7.1-401. Each ES45 is configured with 24GB of RAM at 50% utilisation, and has 4 CPU installed. Peaks VMS locks are about 3,000,000 million. We run RDB with Global Buffers enabled. A typical user allocated about 900 locks. There are about 600 users attached to the main database.

When a number of users disconnect from the Database, typically because of LAN or Server related issues, RDB has to generate Rollbacks using Recovery Unit Journals (RUJ) files. Typically we may have 100 rollbacks in progress. Sometimes one of these takes up to 15 minutes, which in affect makes the database unusable. We are certain that I/O throughput has little to do with the problem. Accounting records indicates little I/O. The recovery work seems to be struggle with what we believe maybe "queuing" related or related to RDB obtaining "freeze lock" .

This is not a new problem and has troubled our systems for years. The customer is very unhappy given that they have invested millions in Wide Are Cluster, yet downtime is still being experience because of RDB and VMS workings.

Would anyone be able to suggest a course of study which will enable us to understand the factors involved in improving RDB Recovery performance ?

Volker Halle · ‎02-12-2006

Thomas,

are you running some performance measuremet tools (eg. ECP or even better T4) ? As these tools collect performance relevant data continously, they can be quite useful for 'after-the-fact' analysis.

T4 would allow you to analyse the system load from an OpenVMS perspective, so there are chances that you could identify a bottleneck, if it should be inside OpenVMS.

Volker.

Wim Van den Wyngaert · ‎02-12-2006

http://www.sciinc.com/remotedba/techinfo/articles/ti4q2.asp ?

Wim

Wim

Jeroen Hartgers_3 · ‎02-13-2006

I think you need a consultand. Because rdb performance is difficult stuff, but there is a lot of performance profit to make.

maybe the question should be why are there so many rollbacks.

may be you should contact www.vxcompany.com they know everything about vms, rdb and oracle

Wim Van den Wyngaert · ‎02-13-2006

May be it is simply the rollback that takes a long time. E.g. if doing the transaction takes 10s to complete, rollback will take by average 5s (depending on what you exactly do).

And since all rollbacks are serialized, you get a block of about 500 s.

But you talk about 15 minutes. So may be there are transactions of 15 minutes to complete. Try to make them smaller/commit faster.

Wim

Wim

Jean-François Piéronne · ‎02-13-2006

How many recovery buffers?

If the value is small and you plenty of free buffers increase this can give a significant boost.

you can use:
$ pipe rmu/dump your_database | sea tt: "recovery buffer

And increase the parameter using
sql> alter database filename your_database
number of recovery buffers is xxx;

where xxx is the number of recovery buffer.

Jean-FranÃ§ois

Thomas Ritter · ‎02-13-2006

Jean-FranÃ Â§ois,

Our recovery buffer values are

Default recovery buffer count is 500

I need to understand what 500 means and what the valid range of values is.
Maybe this is the default value.

Thomas

John Donovan_4 · ‎02-14-2006

Thomas - we have a small 3GB database and our recovery buffer setting is 1700. If you have a large DB & many users, 500 is way too small. See section 10.2 in 7.0 "Guide to Database Maintenance" guide. I got fantastic help from Oracle when I set this up many moons ago.

jd

"Difficult to see, always in motion is the future..."

John Donovan_4 · ‎02-14-2006

Follow up from Oracle page:
Subject: RDBPROD: Impact of NUMBER OF RECOVERY BUFFERS on Row Cache DBR Performance
Doc ID: Note:111080.1 Type: WHITE PAPER
Last Revision Date: 18-JUL-2005 Status: PUBLISHED
Title: Impact of NUMBER OF RECOVERY BUFFERS on Row Cache DBR Performance
Product: Oracle Rdb
Op/Sys: OpenVMS

The database "NUMBER OF RECOVERY BUFFERS" parameter specifies the number database buffers that recovery (DBR) process will use while performing recovery operations. As with user processes, the number of buffers for a DBR process can have a dramatic effect on performance. Generally, increasing the number of buffers for the DBR process will have a positive effect on performance when the DBR process is performing recovery for large transactions (either for rollback of aborted transactions or for recovery (REDO) of many or large transaction when the database FAST COMMIT feature is enabled).

The performance of the database recovery (DBR) process can be especially significant when the Row Cache feature is enabled. After a database or node failure (system crash, for example) for a database with Row Cache enabled, when the database is re-opened, the monitor will create a DBR process to recover the database. This DBR process performs the following steps:

1. All row caches are recovered from the backing store (.RDC) files

2. The oldest checkpoint (of either the "fast commit" or the Record Cache Server (RCS) checkpoint) for any database user is determined

3. All transactions starting at the oldest checkpoint found are (re)applied to the database

4. Each active transaction is rolled back

Because of the potential database I/O required to perform steps 3 and 4, a larger number of buffers can help reduce the duration of the database recovery process.

Testing demonstrates how significant the number of DBR buffers can be on the performance of re-opening a database after node failure. A test case involving 25 users performing 1,000 transactions each (a total of 25,000 transactions) was interrupted by a simulated system crash. After the system was restarted, the database was opened. A specially instrumented database recovery (DBR) process was used to measure the amount of CPU time consumed along with the number of I/Os performed for various portions of the recovery.

With the default of 20 recovery buffers for the DBR process, a total of 53,542 disk I/Os were performed during the REDO portion of recovery. At a rate of 100 I/Os per second, this step would require about 9 minutes. Increasing the buffer count to 1,000 reduced the number of disk I/Os for this step to 4,342. At the same rate of 100 I/Os per second, the REDO step would now take less than 1 minute. However, for this particular test, increasing the number of buffers for the DBR process to 2,000 only reduced the I/O count to 4,262 and further increases of the buffer count resulted in no additional I/O decrease.

Generally, if global buffers are not enabled and there is sufficient memory on the system, a large number of recovery buffers for the DBR process is beneficial. It is important, however, to avoid specifying so many buffers that the DBR process page faults excessively.

When global buffers are enabled, the number of buffers for the database recovery process is limited to the global buffer USER LIMIT quota. In order to increase the number of buffers for the DBR process, both the "USER LIMIT" and "NUMBER OF RECOVERY BUFFERS" parameters may need to be increased.

For node failure recovery when the ROW CACHE feature is enabled, a value of 5000 or 10000 for buffers for the DBR process may be reasonable. The optimum number of buffers will vary greatly depending on the number and complexity of transactions and processes to be recovered and is thus quite application and site specific.
Copyright Â© 2000 by Oracle Corporation. All Rights Reserved.

"Difficult to see, always in motion is the future..."

Thomas Ritter · ‎02-14-2006

Thanks very much for the good advice. However, we are sure we do not have an I/O related problem. The problem seems to be with RDB. Mostly likely something internal. Configuraton maybe.

Chris Barratt · ‎02-14-2006

Thomas,

One thing you might want to look at is the maximum number of processes that can be created on your node. When you get a mass disconnect from your database, a recovery process is created for each attach.

If you have say 300 people attached, and only 200 free process slots in VMS, things can get a bit tangled.

To get around this, you can increase the VMS sysgen limit - maxprocesscnt I think, or alternatively there is an rdb logical which controls how many recovery processes get created at once per database - it is rdm$bind_max_dbr_count.

In our setting, where processes may be connected to up to 10 dbs at once, I have set this to 10, and we have about 350 processes on a system with a maxprocesscnt of around 1500.

This may be your problem, particularly if dbs are open on multiple nodes.

Cheers,
Chris

P.S. In rmu/show stat, the "DIJ" keystroke combination takes you to a DBR activity screen which may help you see what is going on.

Thomas Ritter · ‎02-14-2006

Chris, we have first hand experience what happens when balance set slots are exhausted. The Database will close down with a useful error message why.
We have balance set slots set to about 2048 and average 800 processes per node. We also have Global Buffers enabled on most of our databases, which means that the values of the DBR do not apply to those databases.

From

7.1.2 RDM$BIND_MAX_DBR_COUNT Documentation Clarification Bugs 1495227 and 3916606 The Rdb7 Guide to Database Performance and Tuning Manual, Volume 2, page A-18, incorrectly describes the use of the RDM$BIND_MAX_DBR_COUNT logical. Following is an updated description.

The RDM$BIND_MAX_DBR_COUNT logical name and the RDB_ BIND_MAX_DBR_COUNT configuration parameter define the maximum number of database recovery (DBR) processes to be simultaneously invoked by the database monitor for each database during a "node failure" recovery. This logical name and configuration parameter apply only to databases that do not have global buffers enabled. Databases that utilize global buffers have only one recovery process started at a time during a "node failure" recovery. In a node failure recovery situation with the Row Cache feature enabled (regardless of the global buffer state), the database monitor will start a single database recovery (DBR) process to recover the Row Cache Server (RCS) process and all user processes from the oldest active checkpoint in the database.

Thomas

Jean-François Piéronne · ‎02-14-2006

It would be interesting to look use "RMU/show statistics " and look at
"Recovery Statistics Screen"
(Journaling Information -> Recovery Statistics)
and when recovery are in progress the "DBR Activity Screen"

JF

Jean-François Piéronne · ‎02-14-2006

Another question:
Is fast commit enable on your database?

If yes which parameters are used (checkpoint interval and others)

JF

John Donovan_4 · ‎02-15-2006

"the database monitor will start a single database recovery (DBR) process to recover the Row Cache Server (RCS) process and all user processes from the oldest active checkpoint in the database."

And when this happens I believe the monitor will use the DB recovery buffers. (someone correct me if I'm wrong) 500 is too small. Have you tried incresing this parameter in the DB? Do you have a test system where you can reproduce this or is it all on production?

"Difficult to see, always in motion is the future..."

John Donovan_4 · ‎02-15-2006

Oh I almost forgot the performance issues we had on 7.1 which Oracle helped us with:
1.) Disable FIB (fast incremantal backup)
"alter database filename db_name no incremental backup scan optimization;"

2.) RMU/COLLECT OPTIMIZER/STATISTICS=(CARD,STORAGE) dbroot
or
rmu/collect optimizer db_name/system_relations

I know these helped our performance overall but not sure it will help you w/ DBR. IS ALS running?

"Difficult to see, always in motion is the future..."

Jean-François Piéronne · ‎02-15-2006

John,

Rdb will start only one DBR process at the startup of the database when the database was shutdown and when fast commit is enabled.

When many process failed due to a network problem, for example, Rdb start as many DBR as there are failed process (except if you have defined RDM$BIND_MAX_DBR_COUNT), but the DBR run serially.

Disabled FIB reduce contention on SPAM when many user update (insert, delete) row in the same range of pages

RMU/COLLECT help the optimizer but has no effect on DBR.
DBR only use dbkeys to access rows

JF

John Donovan_4 · ‎02-15-2006

Well I thought it was worth a shot... :))

"Difficult to see, always in motion is the future..."

Thomas Ritter · ‎02-16-2006

We came of this logical name, which we intend to explore.

RDM$BIND_DBR_LOG_FILE.
This logical generates a recovery log file for each recovery process which can give additional useful information for troubleshooting.
The recovery log file includes:
System information.
Active processes.
Specific process information for the recovery process.
Defined and undefined Rdb-related logicals.
DBR event tracking with associated timestamps.

$ DEFINE/SYSTEM device:[dir]DBR_PID.LOG

Thomas

Richard J Maher · ‎02-18-2006

Hi Thomas

I'm sure you already have but, I suggest getting on touch with Oracle Rdb Support on this. 15 mins is an awfully long time! The good news is that you have enough time to find out what the process currently holding the FREEZE lock is actually doing. (I couldn't find if you said that you had Fast Commit enabled?)

Do you see a lot of these messages in the Rdb Monitor log file -

- Dead process transaction 0:0 was not active

Anyway FWIW, I've attached an old discussion from the JCC Listserver (also worth a shot for assistance)

When you get a chance, please post the results as to whether it was a matter of trying to tune/speed up the DBR process, or if there was some other blocking issue.

Cheers Richard

----- Original Message -----
From: "Richard Maher"
To:
Sent: Saturday, March 16, 2002 9:57 PM
Subject: Re: Freeze!

> Hi Federico,
>
> Did you contact Oracle support and if so what was the answer?
>
> Why does the DBR hang on to the freeze lock for so long?
>
> Does it have to hang on to it until recovery is complete?
>
> I used to think it was only held long enough to kick-off and initialize the
> recovery process. (But up until last week, when I looked at some abnornal
> terminations in more detail, I also thought it was the monitor that took out
> the freeze lock, so what do I know)
>
> I took John Creed's advice and turned on the DBR log file and could see
> evidence of pregnant pauses in the timestamps but nothing to say why. From
> RMU/SHOW LOCKS the recovery process clearly has the freeze lock in PR mode
> and nothing, like a long verb, is holding it up but why does it hang on to
> it for so long?
>
> The flip side of this problem is why do we have abnormal terminations? In my
> case it is users clicking the |X| while in a DCL server option and not
> exiting gracefully and they should be persuaded not to, but what can you do?
> Having said that, if I was defending them in a court of law I'd have to ask
> "Why is Rdb starting a recovery process and freezing the database when
> there's clearly nothing to recover?" Do you get alot of these?
>
> - Dead process transaction 0:0 was not active
>
> I tried with pre-started transaction off and although there was no
> transaction to rollback, an RUJ file still seemed to be created (which is
> probably a good idea for performance).
>
> $rmu/dump user
> Active user with process ID 00006848
> Stream ID is 1
> Monitor ID is 1
> Transaction ID is 339
> Recovery journal filename is "MF_PERSONNEL$00019FEAE8D3.RUJ;1"
> No transaction in progress
> Last Process quiet-point was AIJ sequence 2
>
> *The monitor log
> 6-MAR-2002 15:38:32.84 - received recovery status from 00004B78:1
> - process name RDM_RB_1, user SYSTEM
> - for database "MF_PERSONNEL.RDB;1"
> - Dead process transaction 0:0 was not active
>
> So the story so far is:-
>
> . It is a given that abnormal terminations should be eliminated wherever
> possible and users should be reined in. (Especially in the light of
> undo/redo and Fast Commit)
> . Rdb's behaviour (and prestarted transactions in particular) is all geared
> up for performance and this section of the code isn't gonna be changed for a
> couple of dodgy users.
>
> But is there a third option with executive mode rundown procedures? The
> attached code doesn't work and even of it did I don't think the DSRI Rdb$
> routines are expecting to be called from EXEC mode, but couldn't Rdb itself
> at process rundown, see that there was no transaction (or only a pre-started
> transaction) active and do enough to tidy up and tell the monitor this was a
> normal termination? If there was a real txn active then just fall over as
> usual.
>
> I don't know when or where in the death of a process these rundown routines
> are called and maybe the status of Rdb at such times may be too undefined,
> but is it worth a look?
>
> Did you know when using SQLPRE with rdms$keep_prep_files on you only get an
> .MOB object file and not the usual .MAR file that you get with RDBPRE?
>
> Should I have tried to map sql$transaction_ptr instead for SQL?
>
> Depending on when you kill (ctrl-y,stop) any Rdb program that has been
> linked to the following, you can get RDB-E-OPEN_TRANS or COSI-F-ACCVIO and
> so on. If you have a recent copy of the DSRI handbook or you know what the
> MACRO for a SQL>DISCONNECT ALL; would look like then maybe you could have a
> go.
>
> Regards Richard Maher
>
> PS. %AMAC-I-RUNTIMSTK Is the best, the smartest and just the most
> superlative message I have ever seen!
>
>
> $ on warning then exit
> $ if .not. f$privilege("cmkrnl,sysprv") then goto no_priv
> $ if f$getsyi("arch_name") .nes. "Alpha" then goto no_vax
> $!
> $ create rdb_shut.mar
>
> .title Rdb_shut
> .ident "V1.0"
>
> .library "sys$library:lib.mlb"
>
> $plvdef
> $ssdef
>
> msg_vec:
> .long 1
> msg_sts:
> .long ss$_abort
>
> my_rdb_vector:
> .long 0
> rdb$lu_status:
> .long 0
> .blkl 18
>
> .psect
> rdb$transaction_handle,nopic,ovr,rel,gbl,noshr,noexe,rd,wrt,novec,quad
> my_txn_handle:
> .long 0
> .blkb 4 ; Round up
>
> .psect pers,nopic,ovr,rel,gbl,noshr,noexe,rd,wrt,novec,quad
> my_db_handle:
> .long 0
> .long 12
> .ascii /MF_PERSONNEL/
>
> .psect _maher$code,pic,con,rel,lcl,shr,exe,rd,nowrt,quad
>
>
> exec_rundown: .jsb_entry ; Entry point for rundown
>
> pushl #0 ; Caller PC ????
> pushal my_txn_handle
> tstl @(sp) ; Transaction active?
> beql 1$
> pushal my_rdb_vector
>
> calls #03,g^rdb$rollback_transaction ; Why is the standard
> commit?
> blbs r0,2$
> $putmsg_s -
> msgvec=my_rdb_vector
> brb 2$
> 1$:
> $putmsg_s -
> msgvec=msg_vec
>
> addl2 #8, sp
> 2$:
> pushl #0 ; Caller PC
> pushal my_db_handle ; DBhandle
> tstl @(sp) ; Is it already detached
> beql 3$ ; Skip this one if so
> pushal my_rdb_vector
> calls #03,g^rdb$detach_database
> blbs r0,4$
> $putmsg_s -
> msgvec=my_rdb_vector
> brb 4$
> 3$:
> movzwl #ss$_normal,msg_sts
> $putmsg_s -
> msgvec=msg_vec
> addl2 #8, sp
>
> 4$: rsb
>
>
> .PAGE
> .SBTTL Privileged Library Vector
>
> .psect dickie$services,page,vec,pic,nowrt,exe
>
> .long plv$c_typ_cmod ; Set type of vector to change
> ; mode dispatcher
> .long 0 ; Reserved
> .long 0 ; # of Kernel mode routines
> .long 0 ; # of Executive mode routines
> .long 0 ; Kernel routine list
> .long 0 ; Exec routine list
> .long 0 ; Kernel rundown handler
> .address exec_rundown ; Exec rundown handler
> .long 0 ; RMS Dispatcher
> .long 0 ; Kernel routine flags
> .long 0 ; Exec routine flags
>
> .end
>
> $!
> $ macro/list/enable=quad rdb_shut.mar
> $!
> $ link /share=rdb_shut.exe -
> /sysexe -
> /map -
> /cross -
> /full -
> /notrace -
> rdb_shut.obj, -
> sys$input:/options
>
> gsmatch=lequal,2,0
>
> symbol_vector = ( -
> pers=psect, -
> rdb$transaction_handle=psect -
> )
>
> protect=yes
> collect=safe,_maher$data
> protect=no
> collect=user_rw,pers,rdb$transaction_handle
>
> $!
> $copy/log rdb_shut.exe sys$common:[syslib]
> $!
> $if f$file_attributes("sys$share:rdb_shut.exe","KNOWN")
> $then
> $ installx replace sys$share:rdb_shut.exe
> $else
> $ installx add sys$share:rdb_shut.exe /open/header/share/protect
> $endif
> $!
> $purge sys$share:rdb_shut.exe
> $exit
> $!
> $no_priv:
> $ write sys$output "Insufficient privilege. You need (CMKRNL,SYSPRV)"
> $ exit 44
> $no_vax:
> $ write sys$output "This code only works on alpha"
> $ exit 44
>
>
> ----- Original Message -----
> From: Paul Mead
> To:
> Sent: Monday, March 04, 2002 11:36 PM
> Subject: Re: Freeze!
>
>
> > I have to agree that disabling Fast Commit is a sure way to reduce the
> > ability of your application to get work done quickly while possibly not
> > having any affect on the problem you are trying to solve. Rather than
> using
> > the "shotgun" approach to try and resolve this problem why don't you just
> > call Oracle support and get someone who understands, or can find someone
> who
> > understands, DBRs to help you with the problem?
> >
> > ----- Original Message -----
> > From: "Federico Monteverde"
> > To:
> > Sent: Monday, March 04, 2002 3:50 PM
> > Subject: RE: Freeze!
> >
> >
> > > Ok, but our problem is we are already having long recovery times. Very
> > > frequently our database is being freezed for about 20 seconds and more.
> > >
> > > Regards,
> > > Federico
> > >
> > > -----Mensaje original-----
> > > De: Ian Smith [mailto:Ian.E.Smith@oracle.com]
> > > Enviado el: lunes 4 de marzo de 2002 19:06
> > > Para: oraclerdb@jcc.com
> > > Asunto: Re: Freeze!
> > >
> > >
> > > Ok, so ROLLBACK may take longer, is that really an issue? Since most
> > > applications do vastly more COMMIT actions than ROLLBACK then this
> should
> > be
> > > a
> > > reason to *use* FAST COMMIT.
> > >
> > > I believe that savings using FAST COMMIT make any occassional ROLLBACK
> > > overhead
> > > insignificant. It also enables a lot of very important Rdb
> functionality.
> > >
> > > Ian
> > >
> > > Federico Monteverde wrote:
> > >
> > > > Fast commit is disabled. According to documentation it could increase
> > > > rollback times.
> > > >
> > >
> > > --
> > > Ian Smith Read the Technical Corner column
> > > Oracle Rdb Engineering Group in the Rdb Web Journal
> > > email: Ian.E.Smith@Oracle.com http://www.oracle.com/rdb
> > > (Standard disclaimer: The statements and opinions expressed here are my
> > own
> > > and do not necessarily represent those of Oracle Corporation)
> > >
>
>

Volker Halle · ‎02-19-2006

Thomas,

if you weren't running any performance data collector to observe the system load at the time of db recovery, you've certainly looked at accounting.

What do you see: lots of DBR processes with a long elapsed time and only few IOs ?

Reading this thread (and some other info obtained by googling), it seems to indicate that db recovery is single-threaded if global buffers are enabled on the database. The DBR process(es ?) seem to hold the FREEZE lock in PR mode while recovering. Any other processes need this lock in CW to do any database transactions: CW is incompatible with PR, so they have to wait.

Volker.

Thomas Ritter · ‎02-20-2006

The logging is working.
Directory DSA115:[RUJ_LOGS]

DBR_0000E745.LOG;1 105KB/108KB 21/02/06 02:01:16.31
DBR_00016B66.LOG;1 109KB/112KB 20/02/06 16:27:26.63
DBR_0001836F.LOG;1 109KB/112KB 20/02/06 16:32:27.99
DBR_00018380.LOG;1 109KB/112KB 20/02/06 16:40:44.50
DBR_00018B6E.LOG;1 109KB/112KB 20/02/06 16:32:27.99
DBR_00019750.LOG;1 102KB/103KB 21/02/06 02:01:20.59
DBR_00019B4F.LOG;1 102KB/103KB 21/02/06 02:01:20.71
DBR_00019F4C.LOG;1 105KB/108KB 21/02/06 02:01:16.25
DBR_0001A351.LOG;1 103KB/103KB 21/02/06 02:01:19.33
DBR_0001A726.LOG;1 105KB/108KB 21/02/06 02:00:20.20
DBR_0001A753.LOG;1 102KB/103KB 21/02/06 02:01:21.77
DBR_0001AB47.LOG;1 105KB/108KB 21/02/06 02:01:16.12
DBR_0001AB4E.LOG;1 103KB/103KB 21/02/06 02:01:20.39
DBR_0001AF33.LOG;1 105KB/108KB 21/02/06 02:01:16.30
DBR_0001AF4D.LOG;1 103KB/103KB 21/02/06 02:01:19.91
DBR_0001AF55.LOG;1 101KB/103KB 21/02/06 02:01:22.62
DBR_0001AF56.LOG;1 101KB/103KB 21/02/06 02:01:22.65
DBR_0001B342.LOG;1 105KB/108KB 21/02/06 02:01:16.29
DBR_0001B732.LOG;1 104KB/108KB 21/02/06 02:01:16.32
DBR_0001B746.LOG;1 105KB/108KB 21/02/06 02:01:16.29
DBR_0001BB43.LOG;1 105KB/108KB 21/02/06 02:01:16.28
DBR_0001BB52.LOG;1 102KB/103KB 21/02/06 02:01:21.53
DBR_0001BB54.LOG;1 101KB/103KB 21/02/06 02:01:22.46
DBR_0001BF35.LOG;1 105KB/108KB 21/02/06 02:01:16.30
DBR_0001C334.LOG;1 105KB/108KB 21/02/06 02:01:16.30

Total of 25 files, 2.55MB/2.61MB

Just looking at the tail of one file gives

20-FEB-2006 16:27:26.75 - Recovering dead process 0001A35F:1
20-FEB-2006 16:27:26.76 - Database "$1$DGA6:[WIZ_CMTST.DATA.RDB]WIZARD_DATA.RDB;1"
20-FEB-2006 16:27:26.77 - Node failure = 00
20-FEB-2006 16:27:26.77 - Rcache Node failure = 00
20-FEB-2006 16:27:26.77 - OPT Node failure = 00
20-FEB-2006 16:27:26.77 - Recover all = 00
20-FEB-2006 16:27:26.78 - ===== Setting process context to 0001A35F:1 =====
20-FEB-2006 16:27:26.78 - Inherited RTUPB slot = 15
20-FEB-2006 16:27:26.78 - Recovering USER process
20-FEB-2006 16:27:26.78 - TID = 299
20-FEB-2006 16:27:26.78 - RUJ filename = "COMMON:[RDM$RUJ]WIZARD_DATA$0001018B1892.RUJ;1"
20-FEB-2006 16:27:26.80 - Fast commit = 00
20-FEB-2006 16:27:26.80 - Commit-to-Journal = 00
20-FEB-2006 16:27:26.80 - Updating dashboard information
20-FEB-2006 16:27:26.80 - Recovering AIJ information
20-FEB-2006 16:27:26.80 - Initializing AIJ EOF to 7423:41990
20-FEB-2006 16:27:26.81 - AIJ recovery ELAPSED: 0 00:00:00.00 CPU: 0:00:00.00 BUFIO: 0 DIRIO: 0 FAULTS: 1
20-FEB-2006 16:27:26.81 - Waking up hibering processes
20-FEB-2006 16:27:26.86 - Recovering recoverable latches
20-FEB-2006 16:27:26.86 - Default checkpoint location: -1:-2
20-FEB-2006 16:27:26.86 - REDO not necessary; fast commit disabled

20-FEB-2006 16:27:26.86 - Starting transaction UNDO for TSN 0:342868205
20-FEB-2006 16:27:26.88 - COMMIT TSN=0:0 at TSNBLK [1, 15] for TID 299
20-FEB-2006 16:27:26.88 - UPB$TSN=0:342868205, AIJ completed TSN=0:0, AIJ Active TSN=0:0
20-FEB-2006 16:27:26.88 - Appending AIJ entry TID=299 TSN=0:342868205 TYP=R
20-FEB-2006 16:27:26.89 - UNDO recovery ELAPSED: 0 00:00:00.02 CPU: 0:00:00.00 BUFIO: 4 DIRIO: 4 FAULTS: 5
20-FEB-2006 16:27:26.89 - TSN 0:342868205 was rolled back
20-FEB-2006 16:27:26.94 - Total recovery duration 0.09 seconds
20-FEB-2006 16:27:26.94 - Waking up hibering processes
20-FEB-2006 16:27:26.96 - Recovery of process 0001A35F:1 complete

We are not running Fast Commit.

Richard J Maher · ‎02-20-2006

Hi Thomas,

0.20secs elapsed time doesn't seem to add up to 15mins. I think we can safely say that this particular DBR was not the cause of your problem :-)

Are some of the logs bigger than others? Is the interval between create/modified up to 15 mins on some/one of them? I'm sure you know how to do a $rmu/sh sys and edit the monitor log and match the PID of the abnormal terminations?

Anymore clues?

It looks like your DBRs actually have something to rollback? Do you have transactions open over terminal I/O? Do you have a runaway endless loop that writes rubbish to the DB and when the user gets fed up waiting (or it crashes) the DBR takes 15mins to undo the changes?

Clutching at straws, sorry. Without logging on, or getting Rdb support involved (they need the calls :-) I don't think there's much can be done.

Cheers Richard

Jean-François Piéronne · ‎02-21-2006

You can search "UNDO recovery ELAPSED:"
in all recovery log files, to find those which have long elapsed.

As you have not enable fast commit, you DBR have only to do undo, no redo.

Remember DBR process run sequentially.

JF

Thomas Ritter · ‎02-21-2006

Richard, these logs is my way of showing that we now have RUJ logging enabled. The logging will be enabled in Production this Friday. Notice most of the times were about 02:00 am. That's when we logoff users to commence housekeeping. A $forces does not disconnect the user, but the $delprc does.

There maybe application issues.

We have global buffers enabled, but Fast Commit disabled. Fast Commit was aborted about 5 years ago because of poor recovery peformance.

The DBAs have now logged a call with Oracle.

When we have the next big mess in production, I will post some of the RUJ logs. If I find out that this logical RDM$BIND_DBR_LOG_FILE is not something recent, I will not be happy.

Thomas

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: RDB poor Recovery or Rollback performance.

RDB poor Recovery or Rollback performance.