- Community Home
- >
- Servers and Operating Systems
- >
- Operating Systems
- >
- Operating System - OpenVMS
- >
- Diagnosing a performance bottleneck in BACKUP/LIST
Categories
Company
Local Language
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Forums
Discussions
Discussions
Discussions
Forums
Discussions
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Community
Resources
Forums
Blogs
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-04-2011 05:40 PM
04-04-2011 05:40 PM
Diagnosing a performance bottleneck in BACKUP/LIST
This one has got me very puzzled.
I've inherited responsibility for an AlphaServer with this configuration:
AlphaServer DS20 500MHz (single CPU)
1GB RAM
Mylex DAC960 backplane RAID
TZ88 tape
OpenVMS V7.2-1
The system volume is a RAID-1 set containing two 9GB drives. There are four other logical volumes held on a RAID-5 set on the same controller. The disks are spread over three SCSI busses.
The problem I've been asked to investigate is why the evening tape backup takes so long.
I've determined that the backup to tape takes 4-5 hours, which is acceptable. The job then uses this command to create a listing of the tape contents:
$ backup/list=$1$dra1:[kits.backup_list]backup_list.lis tape:*.*
and that command takes up to eight hours to run!
I've used MONITOR to examine I/O and found this:
I/O Request Queue Length CUR AVE MIN MAX
$1$DRA1: (ORFF) ORFF_1 8559.00 8557.52 8559.00 8559.00
which is very suspicious, to say the least. Not just that it's very high, also that the CUR, MIN and MAX figures are identical but the AVE is less. And they don't change. If I use SDA to examine the device it says the I/O request queue is empty.
The MONITOR display for disk I/O rate is much more sensible; the rate varies from 0 to 100 or so for this disk (and average is under 20); the total AVE across all disks is under 50.
So I'm at a loss to understand why it takes twice as long to read the tape as it did to write it; the system is not heavily loaded by user activity.
What else can I look at here? I know the disks are badly fragmented but I wouldn't have thought that would make a big difference to the time it takes to write a saveset listing.
Thanks,
Jeremy Begg
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-04-2011 07:22 PM
04-04-2011 07:22 PM
Re: Diagnosing a performance bottleneck in BACKUP/LIST
In this case, listing the tape saveset should not take 8 hours in my view. How old the tape is? Are you rewinding the tape before listing the saveset? If yes, what is the command and how much time it takes just to rewind the tape? Is the saveset written in the beginning of the tape or end of the tape? What is the capacity of the tape?
Regards,
Ketan
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-04-2011 07:33 PM
04-04-2011 07:33 PM
Re: Diagnosing a performance bottleneck in BACKUP/LIST
One more thing, this is outside the scope of BACKUP, how much time does it take to mount the tape and do dir on it?
Regards,
Ketan
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-04-2011 08:09 PM
04-04-2011 08:09 PM
Re: Diagnosing a performance bottleneck in BACKUP/LIST
The procedure uses multiple BACKUP commands to write the tape then uses DISMOUNT/NOUNLOAD followed by MOUNT/FOREIGN to rewind the tape. Using SET MAGTAPE/REWIND might be slightly faster but not enough to be significant. The backup listing's file creation time is within a few minutes of the last disk backup command completing.
I'll try to organise a DIR listing of the tape to see how long that takes. However it might have to wait until the weekend because if it takes more than a few hours it will interfere with the next backup job.
Thanks,
Jeremy Begg
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-04-2011 09:22 PM
04-04-2011 09:22 PM
Re: Diagnosing a performance bottleneck in BACKUP/LIST
Have you observed the tape during the listing process? Does it seem like there are a lot of pauses where the drive does nothing? Have you checked the TZ88 while listing with $ SHOW DEVICE and $ SHOW DEV/FULL?
Try mounting the tape and performing the BACKUP and adding the following qualifiers to both (if they're not already present)
$ mount/for/media_format=compression/cache=tape_data
$ backup/blah/blah input: output:/media=compression/block=61440
The "Schooner" (DLT) drives work their best when you write using a blocking factor that is a multiple of 4096. So not only can you get a scooch more data on there with the larger blocking factor you write more efficiently when using a multiple of 4096. I would expect it to read better too.
Just some semi random thoughts that might help. You're also somewhat dependent on the account quotas for BACKUP performance on that OpenVMS version so those might be giving you grief too although I'll grant that it shouldn't make that much difference when listing multiple savesets. Do you have a "special" BACKUP account or are you just using something setup for general use?
bob
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-04-2011 10:09 PM
04-04-2011 10:09 PM
Re: Diagnosing a performance bottleneck in BACKUP/LIST
check the fragmentation of the listing file (DUMP/HEAD/BL=COUNT=0) and maybe use MONI FILE,FCP to see how much the XQP is active.
How big is the listing file ?
Try using SET RMS/EXTEND=65535 for increasing the default extend size, if fragmentation could be a problem.
Volker.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-04-2011 10:46 PM
04-04-2011 10:46 PM
Re: Diagnosing a performance bottleneck in BACKUP/LIST
The tape is on its own SCSI controller. (You can't put a tape drive on a DAC960, AFAIK.) There's only one tape error and it's been '1' for several days.
I am several thousand KM from the tape drive so no chance of observing it unfortunately.
The account used for backup is the SYSTEM account and its quotas seem pretty generous. Like I said, the time taken to write the backup is acceptable -- it's the time taken to list the backup tape which is the problem.
The backup procedure was set to use /BLOCK=65535 which resulted in the backup tapes being written with a block size of 65024 (the value I would have put, had I written this procedure). I'll try /BLOCK=61440 tonight and see what happens.
The listing file is 28436 blocks and has hundreds of extents so the disks are badly fragmented (or at least that one is) but I would have expected that to impact the write time more than the listing time. I tried setting the default RMS extension to 2000 last night but it didn't make any difference (the listing file still has hundreds of extents).
Regards,
Jeremy Begg
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-04-2011 10:56 PM
04-04-2011 10:56 PM
Re: Diagnosing a performance bottleneck in BACKUP/LIST
to rule out a performance bottleneck during READING the tape, try the following:
$ BACKUP/LIST=NLA0: tape:*.*
Backup will still have to read and process all the blocks from the tape and it will do the IOs to the 'listing file', only that they now complete in ZERO time.
Regarding fragmentation: I just cut the backup operation time in half for a backup to a saveset on disk by simply using SET RMS/EXTEND=65535 instead of the default. So don't underestimate the effect of writing to a fragmented disk/file. This also may be preventing the tape from streaming !
Volker.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-04-2011 11:21 PM
04-04-2011 11:21 PM
Re: Diagnosing a performance bottleneck in BACKUP/LIST
Does the error count value of the tape drive getting increased when any operation or BACKUP is performed on the tape device? Please check the status of the tape drive when listing the saveset from SHOW DEV command. Are there any errors or events related to hardware logged in the ERRLOG.SYS while listing the saveset on tape?
Regards,
Ketan
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-05-2011 04:37 AM
04-05-2011 04:37 AM
Re: Diagnosing a performance bottleneck in BACKUP/LIST
I should try when I get a moment. Might even start with Jur's MDdriver first, just to see (Trace!) the kind of IOs happening.
Rather then look at Queue depth, I'd sooner like to see IO/sec. But why not have it all and analyze T4 output during the listing window.
Be sure to have T4 running before any next try!
I would also be interested in sample os ANALYZE/SYSTEM PROCIO$SDA output for the list process for a few samples. Just tot check.
While there I would use SDA> SHOW PROC/CHAN and SHOW PROC/RMS=(FAB,RAB,BDBSUM) to see how IO is done
>>> The listing file is 28436 blocks and has hundreds of extents so the disks are badly fragmented
That's not good, but does not explain.
8 hours = 20,000+ second, and even at 1000 extends that is 1 extent every 20 seconds. I think your system can handle that.
Still, might as well try output to NL: or better still to a RAM drive?
$ MCR SYSMAN IO CONNECT MDA1/NOADAPTER/DRIVER=SYS$MDDRIVER
$ INIT /SIZE=200000 MDA1 RAM ! 100 MB
$ MOUN /SYST MDA1 RAM RAM
$ CREA /DIRE RAM:[TEMP]/PROT=(WORLD:RWED)
>> I tried setting the default RMS extension to 2000 last night but it didn't make any difference (the listing file still has hundreds of extents).
The larger default extend will reduce the number of times the system is asked to grow the file and increase the chance it will grab a big chunck, but if there is no competing allocations, then ultimately the same free space will satisfy the request.
You should be able to witness this with DIR/SIZE=ALL for the list file. With the large extent the allocation should 'jump' only a few times and stay put most the time.
hth,
Hein
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-05-2011 06:00 AM
04-05-2011 06:00 AM
Re: Diagnosing a performance bottleneck in BACKUP/LIST
As for the vintage of this gear, I paid US$1300 for a used AlphaServer DS20e dual several years ago (with a bus full of SCSI controllers), and less than half that for an Itanium box, and I've received DLTs faster than this one - for free.
Just about everything you've listed here is a dozen years old.
The Mylex is slow, RAID-5 is slow (and known to expose itself to catastrophic double spindle failures during its recovery processing), the SCSI bus here is slow, the 9 GB drives are slow, the version of VMS is slow, and, well, you're in a target-rich environment for slow.
In terms of raw performance for archival processing, BACKUP (with proper process quotas, etc) was getting 90% of the theoretical bandwidth of the slowest component between the source and the data.
Yes, "old and slow" is a theme in this reply.
Here's the various HP process quota recommendations for BACKUP usernames, and it's typically the proportions that are key, not the absolute values of any of the quotas:
http://labs.hoffmanlabs.com/node/49
Prior to its wholesale replacement with newer gear, I'd verify the quotas, and would also ensure compression/compaction is enabled, and I'd also try enabling fast skip on the tape drive ddcu: device:
$ set magtape/fast_skip=always ddcu:
And (failing a wholesale server swap) I'd look to get to faster SCSI devices all around.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-05-2011 07:33 AM
04-05-2011 07:33 AM
Re: Diagnosing a performance bottleneck in BACKUP/LIST
May or may not be of help here, but is there any reason to NOT combining the BACKUP itself with the generation of the listing?
$ BACKUP
... only make sure
fwiw
Proost.
Have one on me.
jpe
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-05-2011 08:43 AM
04-05-2011 08:43 AM
Re: Diagnosing a performance bottleneck in BACKUP/LIST
If so there is a write penalty with RAID-5. I don't know how RAID-5 is implemented on this controller it might not have enough onboard cache memory to compensate for the write penalty.
Also, BACKUP code uses the default parameters for RMS buffer size and number of buffers. Try with:
$ SET RMS/BLOCK_COUNT=127/BUFFER_COUNT=127
...before the BACKUP/LIST command. This would definitely make a better use of the RAID-5 on the output disk because BACKUP code uses write-behind (asynchronuous) and now uses 127 instead of the default 2 buffers and a buffer size (I/O size to disk) of 127 block instead of the default 32 (or whatever DCL-SHOW RMS shows).
/Guenther
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-05-2011 05:40 PM
04-05-2011 05:40 PM
Re: Diagnosing a performance bottleneck in BACKUP/LIST
I'm sure the ROI would be about a year if your actually paying maintenance on this stuff.
Cheers,
Art
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-05-2011 06:22 PM
04-05-2011 06:22 PM
Re: Diagnosing a performance bottleneck in BACKUP/LIST
Last night's backup wrote the tape but then failed during the listing phase with a parity error so I'm going to recommend that if they've got any money at all the new owners buy some new tapes and clean the drive too. (This was the first time I've seen it log errors during backup.)
I'll try increating the RMS block- and multibuffer counts to see if that helps. I'll also try forcing the tape drive to fastskip=always but I'm not sure how that will really help because the BACKUP/LIST command needs to read the entire file before moving to the next anyway.
Jan, I had thought about adding /LIST to the backup commands which write the tape and it might be worth trying. The risk is that it will blow out the time required to write the tape, which is unacceptable. Another job for the weekend!
More news later ...
Jeremy Begg
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-05-2011 07:03 PM
04-05-2011 07:03 PM
Re: Diagnosing a performance bottleneck in BACKUP/LIST
Two thoughts:
- I agree with Volker's suggestion to run a test with the listing going to NLA0:. That will remove all fragmentation and file extension processing from the equation.
- In a somewhat related experiment, I would increase the RMS buffering and blocking factors significantly. This may require resource quota expansion. For experimental purposes, I might very well try very large increments.
Expanding the quotas will reduce the impact of XQP operations acting as blocks on tape processing. The tape will likely process at speed, with the output results backing up in buffers. Note that I am not in my office at the moment, so my ability to experiment on my systems is limited. The preceding presumes that BACKUP is using normal RMS to process the listing file (I KNOW that it uses normal RMS to write/read the save set itself).
- Bob Gezelter, http://www.rlgsc.com
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-05-2011 08:07 PM
04-05-2011 08:07 PM
Re: Diagnosing a performance bottleneck in BACKUP/LIST
>> Last night's backup wrote the tape but then failed during the listing phase with a parity error
Parity errors are usually caused by tape errors, or by problems with the tape drive or related I/O hardware. Please check the online help on parity. $ help/message parity. Some times a cleaning of the tape drive may resolve such issues.
Regards,
Ketan
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-05-2011 08:46 PM
04-05-2011 08:46 PM
Re: Diagnosing a performance bottleneck in BACKUP/LIST
a parity error on a SCSI tape drive is typically related to a SCSI bus problem. Either a missing terminator or, an illegal bus/cable length or, a bad cable/connector.
Problems with media are reported as DRVERR. Only the errorlog entry would show the real SCSI error.
The SCSI tape driver (MKDRIVER) has a long mapping table to squeeze the tons of SCSI errors into a few OpenVMS SS$_... status values.
/Guenther
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-05-2011 09:42 PM
04-05-2011 09:42 PM
Re: Diagnosing a performance bottleneck in BACKUP/LIST
I disagree with any and all suggestion about NL:, RMS buffering and fragmentation, and to RAID or not to RAID for the list file. (including my own RAMdisk suggesion)
Folks, the raw numbers just are not there.
>>> listing file is 28436 blocks and has hundreds of extents
>>> that command takes up to eight hours to run!
So even if backup did an IO, and an extend for each block then it would have a full second to do so each time.
Any disk, any fragmentation can do this 10 times per second, if not 200 times.
I tried with on my PC with FreeAXP and the LM driver as tape. Relevant log attached.
You can see how backup does NOT use RMS to make the tape-save set
You can see how backup uses basic RMS $PUT with 2 default (32 block=16KB=0x4000) buffers and write behind.
You can see normal IO counts: 1 IO for 32 blocks of list file... so we are talking less than 1,000 IOs in 8 hours.
Waddayathink... could that be a bottleneck? NO.
Now running this emulated AS 400, I _was_ using 100% CPU time%.
Jeremy... was there significant CPU time, enough to stop the tape from streaming?
Average could be well than 100%, but if each tape block took more time to process then to read, then backup might not post the next IO fast enough? Maybe it does not double-buffer on list (versus restore)
Now that we heard about hardware errors... maybe there was some error correction / retry going on taking 'hours' ?
fwiw,
Hein
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-06-2011 07:01 AM
04-06-2011 07:01 AM
Re: Diagnosing a performance bottleneck in BACKUP/LIST
No matter what OpenVMS and BACKUP code would be doing there is a severe performance bottleneck. Question is: Where? Is it the tape drive read side or the disk write side?
A listing to the NL device would at least tell whther or not it is the disk side. And then go from there.
/Guenther
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-06-2011 07:11 AM
04-06-2011 07:11 AM
Re: Diagnosing a performance bottleneck in BACKUP/LIST
What do the disk characteristics look like?
Dan
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-06-2011 08:14 AM
04-06-2011 08:14 AM
Re: Diagnosing a performance bottleneck in BACKUP/LIST
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-06-2011 08:48 AM
04-06-2011 08:48 AM
Re: Diagnosing a performance bottleneck in BACKUP/LIST
And we're on year 8 of the bold 3-5 year plan to replace all our VMS systems. They haven't accomplished much.
Always the same tough questions have to be asked of the business ... how much money is lost per hour/day/week when this 10+ year old production hardware goes south?
Cheers,
Art
p.s. don't forget to test an actual restore. A backup listing should give you some confidence that there's something valid written on that piece of plastic but ... ;-)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-06-2011 06:05 PM
04-06-2011 06:05 PM
Re: Diagnosing a performance bottleneck in BACKUP/LIST
So if I understand right, it is taking 4-5 hours to read data from disk and write it to tape. It is taking double that time to read from the tape and write to the disk.
A similar backup style on one of my sites shows the backup time to be about double the listing time, so your times seem out.
Presumably then, it's an issue of either reading from the tape or writing to the disk. If the former, then maybe someone local can listen to the drive and see if it is constantly moving, or whether it is going back and forth. Might be old knackered tapes (although I'd expect an increase in error count). If the latter, maybe the Mylex has caching disabled or is waiting for writes to complete to the disk before continuing? Any chance one of the R5 disks has failed? This would cause big overheads in calculating values from parity data. Not sure if a failed R5 disk on a Mylex flags anything to VMS. Maybe set RMS extent to 200k, if that is what it will use?
[BTW, this customer wouldn't happen to be in Auckland, would it?]
Have fun,
PJ
Peejay
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If it can't be done with a VT220, who needs it?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-06-2011 06:25 PM
04-06-2011 06:25 PM
Re: Diagnosing a performance bottleneck in BACKUP/LIST
No surprised at all :-)
This system is located in Christchurch, it got moved from one building to another after the earthquake there rendered the first building unsafe. As it happens, the people in the "new" building are the ones who actually use the system.
I think your comment about a failed disk in the RAID5 set might be the solution. I've re-checked OPERATOR.LOG for SWXCR messages and yes, there is a failed drive. (I remember bringing this to their attention before the earthquake and thought it had been fixed, but now I've found out it wasn't. Some people don't deserve to have computers.)
Next step: get HP to replace the failed disk(s).
Thanks,
Jeremy