Re: RDB poor Recovery or Rollback performance.

Volker Halle · ‎02-19-2006

Thomas,

if you weren't running any performance data collector to observe the system load at the time of db recovery, you've certainly looked at accounting.

What do you see: lots of DBR processes with a long elapsed time and only few IOs ?

Reading this thread (and some other info obtained by googling), it seems to indicate that db recovery is single-threaded if global buffers are enabled on the database. The DBR process(es ?) seem to hold the FREEZE lock in PR mode while recovering. Any other processes need this lock in CW to do any database transactions: CW is incompatible with PR, so they have to wait.

Volker.

Thomas Ritter · ‎02-20-2006

The logging is working.
Directory DSA115:[RUJ_LOGS]

DBR_0000E745.LOG;1 105KB/108KB 21/02/06 02:01:16.31
DBR_00016B66.LOG;1 109KB/112KB 20/02/06 16:27:26.63
DBR_0001836F.LOG;1 109KB/112KB 20/02/06 16:32:27.99
DBR_00018380.LOG;1 109KB/112KB 20/02/06 16:40:44.50
DBR_00018B6E.LOG;1 109KB/112KB 20/02/06 16:32:27.99
DBR_00019750.LOG;1 102KB/103KB 21/02/06 02:01:20.59
DBR_00019B4F.LOG;1 102KB/103KB 21/02/06 02:01:20.71
DBR_00019F4C.LOG;1 105KB/108KB 21/02/06 02:01:16.25
DBR_0001A351.LOG;1 103KB/103KB 21/02/06 02:01:19.33
DBR_0001A726.LOG;1 105KB/108KB 21/02/06 02:00:20.20
DBR_0001A753.LOG;1 102KB/103KB 21/02/06 02:01:21.77
DBR_0001AB47.LOG;1 105KB/108KB 21/02/06 02:01:16.12
DBR_0001AB4E.LOG;1 103KB/103KB 21/02/06 02:01:20.39
DBR_0001AF33.LOG;1 105KB/108KB 21/02/06 02:01:16.30
DBR_0001AF4D.LOG;1 103KB/103KB 21/02/06 02:01:19.91
DBR_0001AF55.LOG;1 101KB/103KB 21/02/06 02:01:22.62
DBR_0001AF56.LOG;1 101KB/103KB 21/02/06 02:01:22.65
DBR_0001B342.LOG;1 105KB/108KB 21/02/06 02:01:16.29
DBR_0001B732.LOG;1 104KB/108KB 21/02/06 02:01:16.32
DBR_0001B746.LOG;1 105KB/108KB 21/02/06 02:01:16.29
DBR_0001BB43.LOG;1 105KB/108KB 21/02/06 02:01:16.28
DBR_0001BB52.LOG;1 102KB/103KB 21/02/06 02:01:21.53
DBR_0001BB54.LOG;1 101KB/103KB 21/02/06 02:01:22.46
DBR_0001BF35.LOG;1 105KB/108KB 21/02/06 02:01:16.30
DBR_0001C334.LOG;1 105KB/108KB 21/02/06 02:01:16.30

Total of 25 files, 2.55MB/2.61MB

Just looking at the tail of one file gives

20-FEB-2006 16:27:26.75 - Recovering dead process 0001A35F:1
20-FEB-2006 16:27:26.76 - Database "$1$DGA6:[WIZ_CMTST.DATA.RDB]WIZARD_DATA.RDB;1"
20-FEB-2006 16:27:26.77 - Node failure = 00
20-FEB-2006 16:27:26.77 - Rcache Node failure = 00
20-FEB-2006 16:27:26.77 - OPT Node failure = 00
20-FEB-2006 16:27:26.77 - Recover all = 00
20-FEB-2006 16:27:26.78 - ===== Setting process context to 0001A35F:1 =====
20-FEB-2006 16:27:26.78 - Inherited RTUPB slot = 15
20-FEB-2006 16:27:26.78 - Recovering USER process
20-FEB-2006 16:27:26.78 - TID = 299
20-FEB-2006 16:27:26.78 - RUJ filename = "COMMON:[RDM$RUJ]WIZARD_DATA$0001018B1892.RUJ;1"
20-FEB-2006 16:27:26.80 - Fast commit = 00
20-FEB-2006 16:27:26.80 - Commit-to-Journal = 00
20-FEB-2006 16:27:26.80 - Updating dashboard information
20-FEB-2006 16:27:26.80 - Recovering AIJ information
20-FEB-2006 16:27:26.80 - Initializing AIJ EOF to 7423:41990
20-FEB-2006 16:27:26.81 - AIJ recovery ELAPSED: 0 00:00:00.00 CPU: 0:00:00.00 BUFIO: 0 DIRIO: 0 FAULTS: 1
20-FEB-2006 16:27:26.81 - Waking up hibering processes
20-FEB-2006 16:27:26.86 - Recovering recoverable latches
20-FEB-2006 16:27:26.86 - Default checkpoint location: -1:-2
20-FEB-2006 16:27:26.86 - REDO not necessary; fast commit disabled

20-FEB-2006 16:27:26.86 - Starting transaction UNDO for TSN 0:342868205
20-FEB-2006 16:27:26.88 - COMMIT TSN=0:0 at TSNBLK [1, 15] for TID 299
20-FEB-2006 16:27:26.88 - UPB$TSN=0:342868205, AIJ completed TSN=0:0, AIJ Active TSN=0:0
20-FEB-2006 16:27:26.88 - Appending AIJ entry TID=299 TSN=0:342868205 TYP=R
20-FEB-2006 16:27:26.89 - UNDO recovery ELAPSED: 0 00:00:00.02 CPU: 0:00:00.00 BUFIO: 4 DIRIO: 4 FAULTS: 5
20-FEB-2006 16:27:26.89 - TSN 0:342868205 was rolled back
20-FEB-2006 16:27:26.94 - Total recovery duration 0.09 seconds
20-FEB-2006 16:27:26.94 - Waking up hibering processes
20-FEB-2006 16:27:26.96 - Recovery of process 0001A35F:1 complete

We are not running Fast Commit.

Richard J Maher · ‎02-20-2006

Hi Thomas,

0.20secs elapsed time doesn't seem to add up to 15mins. I think we can safely say that this particular DBR was not the cause of your problem :-)

Are some of the logs bigger than others? Is the interval between create/modified up to 15 mins on some/one of them? I'm sure you know how to do a $rmu/sh sys and edit the monitor log and match the PID of the abnormal terminations?

Anymore clues?

It looks like your DBRs actually have something to rollback? Do you have transactions open over terminal I/O? Do you have a runaway endless loop that writes rubbish to the DB and when the user gets fed up waiting (or it crashes) the DBR takes 15mins to undo the changes?

Clutching at straws, sorry. Without logging on, or getting Rdb support involved (they need the calls :-) I don't think there's much can be done.

Cheers Richard

Jean-François Piéronne · ‎02-21-2006

You can search "UNDO recovery ELAPSED:"
in all recovery log files, to find those which have long elapsed.

As you have not enable fast commit, you DBR have only to do undo, no redo.

Remember DBR process run sequentially.

JF

Thomas Ritter · ‎02-21-2006

Richard, these logs is my way of showing that we now have RUJ logging enabled. The logging will be enabled in Production this Friday. Notice most of the times were about 02:00 am. That's when we logoff users to commence housekeeping. A $forces does not disconnect the user, but the $delprc does.

There maybe application issues.

We have global buffers enabled, but Fast Commit disabled. Fast Commit was aborted about 5 years ago because of poor recovery peformance.

The DBAs have now logged a call with Oracle.

When we have the next big mess in production, I will post some of the RUJ logs. If I find out that this logical RDM$BIND_DBR_LOG_FILE is not something recent, I will not be happy.

Thomas

Jean-François Piéronne · ‎02-22-2006

Thmoas,
"A $forces does not disconnect the user, but the $delprc does. "

What you can do is a $forcex, waiting a little then $delprc, so this let process time to gracefully rollback his open transaction. (it would be better to monitor the process and do a $delprc when it has finish the rollback).

This is for example what WASD do to stop script process.

Jean-FranÃ§ois

David Hammett · ‎06-28-2007

I realisethe thread is a little aged but can anybody advise me of where I should go to get T4 for RDB setup details ?

Hein van den Heuvel · ‎06-28-2007

David wrote>> I realisethe thread is a little aged but can anybody advise me of where I should go to get T4 for RDB setup details ?

Thomas, may be time to lock up this topic huh?

David, why not just start your own topic? The price is right!

Anyway, just google: openvms rdb t4 kit

Check out: http://www.jcc.com/JCC%20Presentations/Tracking%20Rdb%20Performance%20Using%20T4.pdf

It mentions: PerfT4.C from Metalink Note 282894.1

hth,
Hein.

Art Wiens · ‎06-30-2007

Thomas, I don't recall following this thread last year, but reading it now, I want to know how the story ended!

Was RDB support/HP able to help and "retain trust"?

Or did Management give up and order IT to run Oracle on an "alternative" platform?

Quite a cliffhanger! ;-)

Cheers,
Art

Robert Gezelter · ‎06-30-2007

Thomas,

Reading this thread, I will offer a couple of comments.

First, I heartily agree with the recommendation to run T4 and collect utilization statistics. I would also seriously consider implementing application specific data collectors to monitor application (and possibly RDB) metrics. Having these data collectors running on an ongoing basis allows retrospective analysis of what was happening to the system during the slowdown period. (Dislaimer: My firm does this type of project in our consulting practice).

Once the statistics are in hand, the analysis of the problem, whether application, database configuration, RDB (or some combination) can be done in a scientific manner.

For analysis, I would also consider whether it is worth constructing a test system that can create the problem at will (reproduceably) so that changes can be tested. With the ongoing statistic monitoring mentioned earlier, there is also an way of objectively evaluating the changes.

- Bob Gezelter, http://www.rlgsc.com

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: RDB poor Recovery or Rollback performance.