Operating System - OpenVMS
1748021 Members
4745 Online
108757 Solutions
New Discussion юеВ

Diagnosing a performance bottleneck in BACKUP/LIST

 
Hoff
Honored Contributor

Re: Diagnosing a performance bottleneck in BACKUP/LIST

My laptop runs faster than this AlphaServer, and I'm backing up and shuffling disks and files over WiFi faster, too; probably eight minutes to transfer a four-gigabyte file via WiFi between the laptop and a server.

As for the vintage of this gear, I paid US$1300 for a used AlphaServer DS20e dual several years ago (with a bus full of SCSI controllers), and less than half that for an Itanium box, and I've received DLTs faster than this one - for free.

Just about everything you've listed here is a dozen years old.

The Mylex is slow, RAID-5 is slow (and known to expose itself to catastrophic double spindle failures during its recovery processing), the SCSI bus here is slow, the 9 GB drives are slow, the version of VMS is slow, and, well, you're in a target-rich environment for slow.

In terms of raw performance for archival processing, BACKUP (with proper process quotas, etc) was getting 90% of the theoretical bandwidth of the slowest component between the source and the data.

Yes, "old and slow" is a theme in this reply.

Here's the various HP process quota recommendations for BACKUP usernames, and it's typically the proportions that are key, not the absolute values of any of the quotas:

http://labs.hoffmanlabs.com/node/49

Prior to its wholesale replacement with newer gear, I'd verify the quotas, and would also ensure compression/compaction is enabled, and I'd also try enabling fast skip on the tape drive ddcu: device:

$ set magtape/fast_skip=always ddcu:

And (failing a wholesale server swap) I'd look to get to faster SCSI devices all around.
Jan van den Ende
Honored Contributor

Re: Diagnosing a performance bottleneck in BACKUP/LIST

Jeremy,

May or may not be of help here, but is there any reason to NOT combining the BACKUP itself with the generation of the listing?

$ BACKUP /LIST=

... only make sure is NOT on !!!! :-)

fwiw

Proost.

Have one on me.

jpe
Don't rust yours pelled jacker to fine doll missed aches.
GuentherF
Trusted Contributor

Re: Diagnosing a performance bottleneck in BACKUP/LIST

My guess is that the 4-5 hour backup is done from the system disk with RAID-1 and the listing is going to a data disk with RAID-5.

If so there is a write penalty with RAID-5. I don't know how RAID-5 is implemented on this controller it might not have enough onboard cache memory to compensate for the write penalty.

Also, BACKUP code uses the default parameters for RMS buffer size and number of buffers. Try with:

$ SET RMS/BLOCK_COUNT=127/BUFFER_COUNT=127

...before the BACKUP/LIST command. This would definitely make a better use of the RAID-5 on the output disk because BACKUP code uses write-behind (asynchronuous) and now uses 127 instead of the default 2 buffers and a buffer size (I/O size to disk) of 127 block instead of the default 32 (or whatever DCL-SHOW RMS shows).

/Guenther
Art Wiens
Respected Contributor

Re: Diagnosing a performance bottleneck in BACKUP/LIST

Just swap the works with a Proliant running CharonAXP and move on. There's no valid reason to keep 9GB drives in a "production" environment anymore (at least just emulate them if they _really_ need to be 9GB).

I'm sure the ROI would be about a year if your actually paying maintenance on this stuff.

Cheers,
Art
Jeremy Begg
Trusted Contributor

Re: Diagnosing a performance bottleneck in BACKUP/LIST

Like I said, I inherited responsibility for this stuff and I know it's antiquated but it's what I've got. There will be no money for any kind of hardware or software upgrade. I'm told the system only has to run until the end of the year but if it looks like being longer than that I'll see what if I can talk them into a better tape drive.

Last night's backup wrote the tape but then failed during the listing phase with a parity error so I'm going to recommend that if they've got any money at all the new owners buy some new tapes and clean the drive too. (This was the first time I've seen it log errors during backup.)

I'll try increating the RMS block- and multibuffer counts to see if that helps. I'll also try forcing the tape drive to fastskip=always but I'm not sure how that will really help because the BACKUP/LIST command needs to read the entire file before moving to the next anyway.

Jan, I had thought about adding /LIST to the backup commands which write the tape and it might be worth trying. The risk is that it will blow out the time required to write the tape, which is unacceptable. Another job for the weekend!

More news later ...

Jeremy Begg
Robert Gezelter
Honored Contributor

Re: Diagnosing a performance bottleneck in BACKUP/LIST

Jeremy,

Two thoughts:

- I agree with Volker's suggestion to run a test with the listing going to NLA0:. That will remove all fragmentation and file extension processing from the equation.

- In a somewhat related experiment, I would increase the RMS buffering and blocking factors significantly. This may require resource quota expansion. For experimental purposes, I might very well try very large increments.

Expanding the quotas will reduce the impact of XQP operations acting as blocks on tape processing. The tape will likely process at speed, with the output results backing up in buffers. Note that I am not in my office at the moment, so my ability to experiment on my systems is limited. The preceding presumes that BACKUP is using normal RMS to process the listing file (I KNOW that it uses normal RMS to write/read the save set itself).

- Bob Gezelter, http://www.rlgsc.com
Shriniketan Bhagwat
Trusted Contributor

Re: Diagnosing a performance bottleneck in BACKUP/LIST

Hi,

>> Last night's backup wrote the tape but then failed during the listing phase with a parity error

Parity errors are usually caused by tape errors, or by problems with the tape drive or related I/O hardware. Please check the online help on parity. $ help/message parity. Some times a cleaning of the tape drive may resolve such issues.

Regards,
Ketan
GuentherF
Trusted Contributor

Re: Diagnosing a performance bottleneck in BACKUP/LIST

Ketan,

a parity error on a SCSI tape drive is typically related to a SCSI bus problem. Either a missing terminator or, an illegal bus/cable length or, a bad cable/connector.

Problems with media are reported as DRVERR. Only the errorlog entry would show the real SCSI error.

The SCSI tape driver (MKDRIVER) has a long mapping table to squeeze the tons of SCSI errors into a few OpenVMS SS$_... status values.

/Guenther
Hein van den Heuvel
Honored Contributor

Re: Diagnosing a performance bottleneck in BACKUP/LIST


I disagree with any and all suggestion about NL:, RMS buffering and fragmentation, and to RAID or not to RAID for the list file. (including my own RAMdisk suggesion)

Folks, the raw numbers just are not there.

>>> listing file is 28436 blocks and has hundreds of extents
>>> that command takes up to eight hours to run!

So even if backup did an IO, and an extend for each block then it would have a full second to do so each time.
Any disk, any fragmentation can do this 10 times per second, if not 200 times.

I tried with on my PC with FreeAXP and the LM driver as tape. Relevant log attached.

You can see how backup does NOT use RMS to make the tape-save set

You can see how backup uses basic RMS $PUT with 2 default (32 block=16KB=0x4000) buffers and write behind.
You can see normal IO counts: 1 IO for 32 blocks of list file... so we are talking less than 1,000 IOs in 8 hours.
Waddayathink... could that be a bottleneck? NO.

Now running this emulated AS 400, I _was_ using 100% CPU time%.

Jeremy... was there significant CPU time, enough to stop the tape from streaming?

Average could be well than 100%, but if each tape block took more time to process then to read, then backup might not post the next IO fast enough? Maybe it does not double-buffer on list (versus restore)

Now that we heard about hardware errors... maybe there was some error correction / retry going on taking 'hours' ?

fwiw,
Hein

GuentherF
Trusted Contributor

Re: Diagnosing a performance bottleneck in BACKUP/LIST

Hein, you pointed to an obvious mystery: ca. 900 I/Os per hour to disk!?

No matter what OpenVMS and BACKUP code would be doing there is a severe performance bottleneck. Question is: Where? Is it the tape drive read side or the disk write side?

A listing to the NL device would at least tell whther or not it is the disk side. And then go from there.

/Guenther