Operating System - OpenVMS
cancel
Showing results for 
Search instead for 
Did you mean: 

Diagnosing a performance bottleneck in BACKUP/LIST

 
Highlighted
Trusted Contributor

Diagnosing a performance bottleneck in BACKUP/LIST

Hi,

This one has got me very puzzled.

I've inherited responsibility for an AlphaServer with this configuration:

AlphaServer DS20 500MHz (single CPU)
1GB RAM
Mylex DAC960 backplane RAID
TZ88 tape
OpenVMS V7.2-1

The system volume is a RAID-1 set containing two 9GB drives. There are four other logical volumes held on a RAID-5 set on the same controller. The disks are spread over three SCSI busses.

The problem I've been asked to investigate is why the evening tape backup takes so long.

I've determined that the backup to tape takes 4-5 hours, which is acceptable. The job then uses this command to create a listing of the tape contents:

$ backup/list=$1$dra1:[kits.backup_list]backup_list.lis tape:*.*

and that command takes up to eight hours to run!

I've used MONITOR to examine I/O and found this:

I/O Request Queue Length CUR AVE MIN MAX

$1$DRA1: (ORFF) ORFF_1 8559.00 8557.52 8559.00 8559.00

which is very suspicious, to say the least. Not just that it's very high, also that the CUR, MIN and MAX figures are identical but the AVE is less. And they don't change. If I use SDA to examine the device it says the I/O request queue is empty.

The MONITOR display for disk I/O rate is much more sensible; the rate varies from 0 to 100 or so for this disk (and average is under 20); the total AVE across all disks is under 50.

So I'm at a loss to understand why it takes twice as long to read the tape as it did to write it; the system is not heavily loaded by user activity.

What else can I look at here? I know the disks are badly fragmented but I wouldn't have thought that would make a big difference to the time it takes to write a saveset listing.

Thanks,
Jeremy Begg

26 REPLIES 26
Highlighted
Trusted Contributor

Re: Diagnosing a performance bottleneck in BACKUP/LIST

Hi.

In this case, listing the tape saveset should not take 8 hours in my view. How old the tape is? Are you rewinding the tape before listing the saveset? If yes, what is the command and how much time it takes just to rewind the tape? Is the saveset written in the beginning of the tape or end of the tape? What is the capacity of the tape?

Regards,
Ketan
Highlighted
Trusted Contributor

Re: Diagnosing a performance bottleneck in BACKUP/LIST

Hi,

One more thing, this is outside the scope of BACKUP, how much time does it take to mount the tape and do dir on it?

Regards,
Ketan
Highlighted
Trusted Contributor

Re: Diagnosing a performance bottleneck in BACKUP/LIST

Hi Ketan,

The procedure uses multiple BACKUP commands to write the tape then uses DISMOUNT/NOUNLOAD followed by MOUNT/FOREIGN to rewind the tape. Using SET MAGTAPE/REWIND might be slightly faster but not enough to be significant. The backup listing's file creation time is within a few minutes of the last disk backup command completing.

I'll try to organise a DIR listing of the tape to see how long that takes. However it might have to wait until the weekend because if it takes more than a few hours it will interfere with the next backup job.

Thanks,
Jeremy Begg
Highlighted
Respected Contributor

Re: Diagnosing a performance bottleneck in BACKUP/LIST

Jeremy, lets talk configuration. How is the TZ88 connected to the system? Does it share a shelf with the disk drives or does it have it's own SCSI controller? Does it generate any errors either during the BACKUP or the pass to list the tape? Is the same account used to list that is used to write the tape? What are the account quotas?

Have you observed the tape during the listing process? Does it seem like there are a lot of pauses where the drive does nothing? Have you checked the TZ88 while listing with $ SHOW DEVICE and $ SHOW DEV/FULL?

Try mounting the tape and performing the BACKUP and adding the following qualifiers to both (if they're not already present)

$ mount/for/media_format=compression/cache=tape_data

$ backup/blah/blah input: output:/media=compression/block=61440

The "Schooner" (DLT) drives work their best when you write using a blocking factor that is a multiple of 4096. So not only can you get a scooch more data on there with the larger blocking factor you write more efficiently when using a multiple of 4096. I would expect it to read better too.

Just some semi random thoughts that might help. You're also somewhat dependent on the account quotas for BACKUP performance on that OpenVMS version so those might be giving you grief too although I'll grant that it shouldn't make that much difference when listing multiple savesets. Do you have a "special" BACKUP account or are you just using something setup for general use?

bob
Highlighted
Honored Contributor

Re: Diagnosing a performance bottleneck in BACKUP/LIST

Jeremy,

check the fragmentation of the listing file (DUMP/HEAD/BL=COUNT=0) and maybe use MONI FILE,FCP to see how much the XQP is active.

How big is the listing file ?

Try using SET RMS/EXTEND=65535 for increasing the default extend size, if fragmentation could be a problem.

Volker.
Highlighted
Trusted Contributor

Re: Diagnosing a performance bottleneck in BACKUP/LIST

Hi,

The tape is on its own SCSI controller. (You can't put a tape drive on a DAC960, AFAIK.) There's only one tape error and it's been '1' for several days.

I am several thousand KM from the tape drive so no chance of observing it unfortunately.

The account used for backup is the SYSTEM account and its quotas seem pretty generous. Like I said, the time taken to write the backup is acceptable -- it's the time taken to list the backup tape which is the problem.

The backup procedure was set to use /BLOCK=65535 which resulted in the backup tapes being written with a block size of 65024 (the value I would have put, had I written this procedure). I'll try /BLOCK=61440 tonight and see what happens.

The listing file is 28436 blocks and has hundreds of extents so the disks are badly fragmented (or at least that one is) but I would have expected that to impact the write time more than the listing time. I tried setting the default RMS extension to 2000 last night but it didn't make any difference (the listing file still has hundreds of extents).

Regards,
Jeremy Begg
Highlighted
Honored Contributor

Re: Diagnosing a performance bottleneck in BACKUP/LIST

Jeremy,

to rule out a performance bottleneck during READING the tape, try the following:

$ BACKUP/LIST=NLA0: tape:*.*

Backup will still have to read and process all the blocks from the tape and it will do the IOs to the 'listing file', only that they now complete in ZERO time.

Regarding fragmentation: I just cut the backup operation time in half for a backup to a saveset on disk by simply using SET RMS/EXTEND=65535 instead of the default. So don't underestimate the effect of writing to a fragmented disk/file. This also may be preventing the tape from streaming !

Volker.
Highlighted
Trusted Contributor

Re: Diagnosing a performance bottleneck in BACKUP/LIST

Hi,

Does the error count value of the tape drive getting increased when any operation or BACKUP is performed on the tape device? Please check the status of the tape drive when listing the saveset from SHOW DEV command. Are there any errors or events related to hardware logged in the ERRLOG.SYS while listing the saveset on tape?

Regards,
Ketan
Highlighted
Honored Contributor

Re: Diagnosing a performance bottleneck in BACKUP/LIST

You gotta think backup fails to keep the tape streaming. That'll kill performance. But why would that happen? Because it did not turn around quickly enough to issue the next read. But why? Busy with CPU stuff?... check T4. Waiting for list file IO? Unlikely, that's probably with RMS $PUT to a file with WBH (write behind) which is async.

I should try when I get a moment. Might even start with Jur's MDdriver first, just to see (Trace!) the kind of IOs happening.

Rather then look at Queue depth, I'd sooner like to see IO/sec. But why not have it all and analyze T4 output during the listing window.
Be sure to have T4 running before any next try!

I would also be interested in sample os ANALYZE/SYSTEM PROCIO$SDA output for the list process for a few samples. Just tot check.
While there I would use SDA> SHOW PROC/CHAN and SHOW PROC/RMS=(FAB,RAB,BDBSUM) to see how IO is done

>>> The listing file is 28436 blocks and has hundreds of extents so the disks are badly fragmented

That's not good, but does not explain.
8 hours = 20,000+ second, and even at 1000 extends that is 1 extent every 20 seconds. I think your system can handle that.
Still, might as well try output to NL: or better still to a RAM drive?

$ MCR SYSMAN IO CONNECT MDA1/NOADAPTER/DRIVER=SYS$MDDRIVER
$ INIT /SIZE=200000 MDA1 RAM ! 100 MB
$ MOUN /SYST MDA1 RAM RAM
$ CREA /DIRE RAM:[TEMP]/PROT=(WORLD:RWED)


>> I tried setting the default RMS extension to 2000 last night but it didn't make any difference (the listing file still has hundreds of extents).

The larger default extend will reduce the number of times the system is asked to grow the file and increase the chance it will grab a big chunck, but if there is no competing allocations, then ultimately the same free space will satisfy the request.
You should be able to witness this with DIR/SIZE=ALL for the list file. With the large extent the allocation should 'jump' only a few times and stay put most the time.

hth,
Hein