Re: BACKUP over DECnet, file extensions, performance

EdgarZamora_1 · ‎11-18-2008

I inherited a script that basically did a BACKUP over DECnet (i.e. BACKUP dev:[dir]*.* rnode::rdev:[rdir]saveset.bck/save) (notice use of DECnet proxy) The files being backed up are mostly huge RMS indexed files. There are 6 directories that this script backs up (to different savesets) and it did so one at a time. Total elapsed time about 3 hours.

I thought I could decrease that elapsed time so I had this brilliant idea to just fire off six separate procedures simultaneously since the network guy said we had great bandwidth. Total elapsed time: over 24 hours! After much digging I determined it was file fragmentation (all savesets were being written to the same remote disk, which had lots of free space and a free space fragmentation of 0.) The savesets were constantly being extended and competing with each other. DFU gave me a file fragmentation index of 643395904.000 (poor) whereas it was close to 0 previously.

I'm not ready to give up on running multiple backups (and other than writing to separate remote disks) does anyone have any better suggestions? I saw a previous note... http://forums11.itrc.hp.com/service/forums/questionanswer.do?threadId=1248535 where Bob Gezelter recommends setting rms/extend on the remote node. I'm going to test that later on, but I'm not really thrilled about having to add that set rms/extend command in the login.com (for network jobs) because there could be other network jobs for the account that could be doing other stuff.

Any helpful comments/ideas are most welcome.

Steven Schweda · ‎11-18-2008

I don't see a way to pre-allocate a big save
set file, but perhaps it would make some
sense to create a big LD device on the
destination system for each job, and write
the output save set to that. At least the
jobs wouldn't be fighting over allocations on
the same file system. Also, MOUNT /EXTENSION
(or SET VOLUME /EXTENSION) might let you set
a bigger default extension value on an LD
volume without bothering LOGIN.COM anywhere.

You would need to guess a good LD device
size ahead of time. Too small, and the job
fails. Too large, and you've tied up a bunch
of extra space, and you might need to copy
the save set off the LD to somewhere else to
be able to recover it.

> Any helpful comments/ideas are most
> welcome.

If you insist on "helpful", it gets harder.

Hein van den Heuvel · ‎11-18-2008

>> but I'm not really thrilled about having to add that set rms/extend command in the login.com

Don't be afraid to set SET FILE/EXTEN to the max (65K), or at least set it to several thousands which is likely to be 100 times better then it is.

Most/many usages of the extent for new files creates will truncate when done.

But admittedly, if you later have hundreds of little files come in by FTP then you want to select a middle of the road value (like 1000?).

Maybe you can use the time of day for a clue as to how to set the extent, or the originating node for the connection?

You probably also want to yack up SET RMS/NETWORK)BLOCK_COUNT

>> The files being backed up are mostly huge RMS indexed files.

You may want to question the value-add of using BACKUP for a few huge files.
You may be better of just using COPY or even CONVERTing to a remote sequential file or PULLING the file instead of pushing, to control the output better (pre-allocate).

Finally, going from 1 job to 2 concurrent ones might just give you ample improvement over what you had whilest avoiding teh worst of the contention.

Hope this helps some
Hein.

Steven Schweda · ‎11-18-2008

> > but I'm not really thrilled about having
> > to add that set rms/extend command in the
> > login.com

> Don't be afraid [...]

Or set up a new account with its own
LOGIN.COM, and the appropriate proxies to let
you use it. Then go wild.

Hoff · ‎11-18-2008

Tweak the receiver DCL, or the FDL.

http://64.223.189.234/node/598

Robert Gezelter · ‎11-18-2008

Edgar,

I have recommend three things using a SET RMS command conditioned on a NETWORK login (using F$MODE() to check):

- EXTEND=66535
- BUFFER=255 (Hein will disagree with me)
- BLOCK=127 (Hein will disagree with me)

These are the maximum numbers, you can experiment with lowering them based on observed performance. I presume that memory is not a problem. Paging and working sets may also need to be increased.

As to the issue of other jobs, there are a variety of ways I could see conditionalizing things more precisely (one of which is cloning the account and using a special account for the BACKUP operations). One could also possibly do something other tricks, I would have to check into the details.

And yes, I have seen impressive speedups by varying these settings, even when using DECnet remote file access within a node.

- Bob Gezelter, http://www.rlgsc.com

Jon Pinkley · ‎11-18-2008

Edgar,

What bottleneck are you trying to avoid by having multiple streams?

If all the input directories are on the same device (not clear if they are or not), and there is a single output device that will hold all the save sets, you may be increasing the time due to increased head movement on the source and destination disks.

Large file extensions and network buffer tuning will help whether you have multiple streams or not.

I am not convinced that fragmentation is necessarily the cause, I would guess it is the frequent extensions, the multiple streams reducing the effectiveness of the extent caches, and the contention for the single output disk. By creating the LD devices, you can eliminate the fragmentation, but the disk with the container files will still trash as the heads seek from one container file to another.

Since you are backing up "mostly huge" files, if the backup job is the only thing accessing the drive, I would expect single stream to be near optimal as far getting the data off the disk. By multi-streaming, you may be able to get more than your fair share of network bandwidth, but unless you are competing with other network activity, the only advantage multi-streaming can provide is more buffering. And if you can increase the buffers available to the single stream, you may be able to get higher utilization of the network with a single stream.

Summary: multi-streaming is not always better than single streaming, especially when the streams are contending for a common resource.

Jon

it depends

Hein van den Heuvel · ‎11-18-2008

>> - BUFFER=255 (Hein will disagree with me)

Yes I do, but I should re-run my tests some day soon.

Still, as a mental excercise, how could anything more than a hand full of buffers possibly help with sequential access?

IMHO all that does is, firstly set false expectations, secondly eat some memory and, and finally increase risks a tiny bit.

Just imagine an infinite fast network as well as source. Those buffers at full size will allow the receiver to fill up 16MB of memory. Then what? They still need to be written to the disk and the connection should stay until that is done. So now you potentially launch 255 IO's. Is that going to help anyting? Do you want to blow out of DIRIO? did you want to change from a simple sequential write pattern to effectively random? Do you like spiking you controller cache and hurting other users instead of throttling?

Now nothing is infinitely fast.
In reality you will end up using 2, maybe 3 buffers. If that's the case, then why confuse the world by suggesting that hundreds of buffers will help!
Due to the transient nature of tests, anything will be hard to prove.
Maybe $SET PROC/SSLOG ?

If 255 buffers were to be used for real then RMS would have to walk its buffer Descriptors (BDB's) to find the right one to use. Those are linked in VBN order, so that would get harder and harder. In my prior tests walking 255 BDBs actually took measurable time, when done for each record added. It became slower than doing the IO!
Fortunately, you typical Network IO is done in sequential only (SQO) mode, and RMS will forgot about the buffer right after the IO as the application promissed not to look back.

>> - BLOCK=127 (Hein will disagree with me)

I'm fine with that, allthough I tend to pick 124 'just in case'.

Some tests suggest that writing in multiples of 16 block will reduce XFC overhead a little and thus increase trhoughput, and it does. Blocks=112 is the best you can do for that. But it is hard to argue with doing 10% fewer IOs.

Other tests suggest trying to keep IO started at 4 (16?) block LBN numbers will help the IO controllers (notably EVA).
But to accomplish that the cluster size and buffer size must both be 4 (16?) block multiples. My 124 choice helps with imporoving those odds, while not increasing the number of IOs too much.

fwiw...
I recently started experimenting with changing VCC_MAX_IOSIZE to 126.
This allows one to choose the simple SET FILE/RMS/BLO=127 as a tool to bypass the XFC for those places where you do not come back the data soon.

Cheers,
Hein.

Wim Van den Wyngaert · ‎11-18-2008

You can use T2T.

backup files node::"task=x"/sav/b;ock=32256/group=0

and x.com on the other side

convert/fdl=sys$input disk_file.sav
xxx

where xxx is an fdl contents of your backup file with big allocatioin and extend size.

Wim (not tested it, using something like this it for remote backup to tape)

Wim

Robert Gezelter · ‎11-19-2008

Hein,

With all due respect, I will not disagree with you that 255 buffers almost never get used. The key word in the preceding is ALMOST.

While I have not run them in a while, I did run some interesting timing tests in an environment where head movement was a given (e.g., a single user disk environment where the application was doing one-for-one processing of a sizable sequential file). The number of buffers used, even for large block sizes was impressive.

The preceding caused that relatively show CPU, with comparable disks, to range from 10% utilization to saturation, with the only variable being the number of buffers and their size.

Admittedly, that environment did stabilize short of the maximum, but it was very sensitive to the performance of the different elements of the system. In that case, the speed match was between the disk and itself, allowing for fragmentation, window turns, and file extends.

Edgar has shared little with us on the precise configuration of the systems and network involved in this. Even with large extend sizes, BACKUP is impressive at generating a data stream. In my (admittedly limited) experiments using DECnet within a node (effectively an infinite speed network), I have certainly maxed out at more than a handful of buffers, although I do max out the CPU running BACKUP.

As is said with vehicle mileage stickers, "Your mileage may vary". If there is one thing I have learned in all of my years of performance tuning, for systems and applications, one must be prepared to be surprised by the unexpected.

- Bob Gezelter, http://www.rlgsc.com

Wim Van den Wyngaert · ‎11-19-2008

In enclosure a working example using T2T.

It backups a hardcoded set. If you start it with p1=remote node and P2=100000 it will be 4 times as fast as the default backup (AS500, 7.3).

If you give P2 =100 it will take 6 times as long. My default rms extend is 32.

Wim

Wim

EdgarZamora_1 · ‎11-19-2008

Good morning... I didn't expect to see so many responses, but thank you!

Steven, thanks for the LD suggestion. That did cross my mind as a possible solution. I will look into that. And yes I'm also considering using a dedicated account for doing this (sigh... another account I'll have to justify to the SOX auditors as to why it has privs).

Hoff, thanks for the link. Very interesting article. Wim, thanks for your example t2t. I will definitely look into using t2t.

Hein and Bob, thanks for your responses. In combination with Steven's suggestion of using a dedicated account, the SET RMS commands in login.com is what I'll look into first (since it's really the easiest for me to set up and test right now). Bob, I didn't share much detail on the environment because I figured it was a pretty generic situation, but here's more info on the environment:

The RMS indexed files are production data being backed up to the development system (so the programmers can test against more recent data). When the savesets are done being copied over to the development system, another procedure restores them for use by the developers. This "refresh" exercise is done maybe once a month or so. The system environment is Alpha OpenVMS 8.3 latest patches; EMC Symmetrix storage for production, MSA1000 for development; standalone systems (no clustering). DECnet is being used to do this refresh because TCP/IP network between prod and dev are separated. The two systems involved are actually sitting on the same rack.

Jon, the whole situation arose because, like I said initially, I had this "brilliant" idea that I could decrease the elapsed time (about 3 hours for single stream) of the whole refresh process. Maybe I should stop trying to improve things? Yes, the input directories are on the same disk and yes there is a single output device (currently). So you may be right in that the single stream is the optimal way to go. (and yes I was trying to steal more of the network pie). I agree with you that fragmentation is not necessarily the cause of the slowness, but I can tell you definitely that the previous savesets when ran one at a time were contiguous and these savesets I created this time had between 100k to 200K fragments each. Interesting tidbit I saw on the destination system while the slowness was going on... the read ops rate (on the destination disk) was a thousand times (or more) more than the write io rate (no other activity going on on that disk except for the FAL process writing the savesets). I don't have the screen history anymore so I don't have the exact numbers.

Wim Van den Wyngaert · ‎11-19-2008

I hope Hein can add some explenation on my next test.

I added deferred write in the FDL of the convert and it was accepted by convert. But performance didn't improve.

Wim

Wim

GuentherF · ‎11-19-2008

My guess this is the input device slowing down the multiple-stream approach. I would do a simple test: Run the backups to the NLA0 device - once serial single-streams and then the multi-stream. It may show that the reading parts is slowing things down in the multi-stream approach.

Having good file fragmentation on the output size is a performance plus. Also small extend sizes (+/- 1,000) would help to keep the disk head close to the same location for all streams. Imagine all save set files pre-allocated and then filling each bottom-up. Imagine the disk head strokes necessary to do that.

Also using /BUFFER=255. This builds quote a disk I/O queue most likely a whole set from stream A then from stream B etc. That means e.g. stream B might have to wait for all stream A' I/Os queued ahead to finish first.

Also BACKUP does not use RMS but uses $QIO to read from the disk. The number of outstanding I/Os is controlled by the process' DIOLM (among others) - replaced by /IO_LOAD sinc V8.3. Small numbers in the range of 5-10 typically yields better performance.

Keep in mind: Bigger not always is better!

/Guenther

Hein van den Heuvel · ‎11-19-2008

Side discussion(s)...

Wim> I added deferred write in the FDL of the convert and it was accepted by convert. But performance didn't improve.

Correct, as to be expected.

First, Deferred Write is ONLY applicable for SHARED access where the RMS default it to write-through any record changes. Setting DFW tells RMS not to update the disk for each record added, but only when an other accessor jiggles the lock.
For the purpose demonstrated, there is no sharing and RMS will always defer, untill the buffer is full, at which point WRITE BEHIND (WBH) may or might not get activated.

Even if deferred write was active then RMS would just sit on dirty buffers until one more buffer is requested than available and at that point would it start the write.
So that would only postpone and create a spike at file close, not provide improvements.

The write-behind option is the strongest performance feature rms has and the strongest performance boost is can offer comes from going to 1 to 2 buffers, anything more that 2 buffers suggests that you are overloading the output device and will only marginally help by keeping stuff in the pipe line.
Last time I looked the improvements in elapsed time trying 1,2,4,8,16 buffers was like 10, 6, 5, 4.5, 4.4

Your milage can and will vary based on CPU speed, source speed and storage deployed.

btw.. Last time I tried 255 buffers for real the application indeed went from low CPU usage to high CPU usage, but the amount of work done did not increase but decrease as it only added cpu time and did nothing to speed up matters. How could it? The basic work of writing data to the disk still needs to be done. Larger buffers help that, but 'more than enough' buffers do not help more.

Applications using random access in may indeed very well benefit from more buffers, but a simple file copy is not one of those.

Cheers,
Hein.

John Gillings · ‎11-19-2008

Edgar,

Running jobs in parallel is only going to be an improvement if there are "gaps" in processing that parallelism can fill. For single stream I/O bound jobs, running multiple copies can convert a long stream of single I/Os into multiple streams and may give you a wallclock time win.

However, in the case of BACKUP it's doing asynch I/Os anyway, so at best it won't be an improvement, and at worst the streams will get in each other's way (As you have observed). Look at your T4 data for the job running. If you don't see any significant wait states running multiple streams isn't the way to go (unless you have multiple CPUs and the I/O streams aren't competing at either end).

Timesharing is a must for interactive processing, but for batch jobs, even ignoring the context switching overhead, it's a losing proposition.

Consider, you have 2 jobs which each take 1 hour to process. If they're competing for resources running them together will take 2 hours, so the mean run time is 2 hours.

Running them in sequence, will still take 2 hours, but the first will complete after 1 hour, so mean run time is 1.5 hours. Again, multiple CPUs may change this.

A crucible of informative mistakes

Cass Witkowski · ‎11-23-2008

If your backup job is the only one writing to the disk, then the single stream backup job will produce non-fragemented files. When you have multilple backup jobs then each one is getting an extent of space and then the next job gets an extent. The default was like 1 block so you can really produce fragmented files. Can you use COPY/ALLOCATE instead of backup or backup locally and use copy/allocate to move the saveset over to the other node?

The set RMS/EXTENT command on the remote side will help.

The SET RMS/EXTENT is also very helpful when creating save sets on disks. On an Itanium server we saw a backup of 72 GB 15K disks on MSA1000 slow down to 1.5 MB/S because the save set on disk was being extended 1 extent at a time. We were performing over 4,000 FCP called a second extending the save set. The Ambassador I brought this up to didn't think engineering should be bothered with this problem. I disagree. One would hope Backup would be smart enough to ask for a reasonable extent based on how much it thinks it has to back up.

Wim Van den Wyngaert · ‎11-24-2008

"If your backup job is the only one writing to the disk, then the single stream backup job will produce non-fragemented files"

I tried it with dcl write script on an empty disk. And monitored it with defrag/int=decw report volume fragmentation graph.

The extents are being allocated all over the disk, thus the files are fragmented (actually on the first half of the disk to start with but leaving big spaces between the fragments).

Wim

Wim

Jim_McKinney · ‎11-25-2008

> I tried it with dcl write script on an empty disk.

If the disk were empty due to deletion of files, the blocks occupied by those files would be held in the extent cache (if caching is active) and then be first used during creation of the new file. Within the cache no attempt would have been made to consolidate adjacent blocks into one contiguous bundle. This might explain what you've observed.

I'd expect that if your disk were mounted without an extent cache, that all blocks would be allocated from the beginning of the BITMAP (low LBN) to end (high LBN) in a contiguous manner (excepting for placement of the INDEXF possibly in the middle of the block range). (I also realize that having parameters such as no-cache were not previously part of this discussion - and if eliminating caching does not produce this behavior I would be interested in learning this.)

Wim Van den Wyngaert · ‎11-25-2008

Bad luck : I tested on a newly init-ed disk.
http://forums11.itrc.hp.com/service/forums/questionanswer.do?threadId=1141655

I did all kind of backups of 60 MB between a AS500 and a AS1000 without routers in between (yes I know, very old hardware). 10 Mbit network card and VMS 7.3. All backups have source AS500 and destination AS1000.

1. Backup with extend size 64 : 40 min
2. backup with extend size 50000 : 35 min
3. backup to a NFS disk (thus IP) : 10 min
4. my posted backup script but to the local node, destination a NFS disk (thus IP): 2 min
5. my posted backup script directly to the AS1000 : 1 min

I redid 1, 2 and 5 later with about the same performance. Of course network load is not always identical.

Wim

Wim

Wim Van den Wyngaert · ‎11-25-2008

And I continued with simultanious backups made by the posted script.
1 only : 55 sec
2 (5 seconds after each other) 1m38 sec
3 : 2m22sec

Very small gain but no loss (note : directly allocated the size best-try-cont).

Wim

Wim

Jim_McKinney · ‎11-25-2008

Wim, did your disk have an extent cache enabled?

Robert Gezelter · ‎11-25-2008

Wim,

A note of caution is in order. Until one understands WHY the different scenarios behaved differently, it is dangerous to make presumptions.

In this case, I would be very cautious. I would also be very interested in seeing these tests with various settings for the different RMS parameters on the remote node. I would also be interested in an analysis of where the bottleneck is.

- Bob Gezelter, http://www.rlgsc.com

Wim Van den Wyngaert · ‎12-01-2008

Jim,

To my knowledge space is never allocated contiguously if you allocate it in different parts. At the botom of my old thread
the best result was 4 fragments.

Wim

Wim

Jim_McKinney · ‎12-01-2008

I'm somewhat surprised, but my testing corroborates what Wim observed. Even without extent caching, and only a single writer to a disk, blocks are not allocated contiguously by default when a large file is created.

$ ld create/siz=100000 test
$ ld connect $1$dga1007:[000000]test.dsk lda7: /AlloClass=12/Share
$ init/system lda7: test
$ mount/noassist/nocache lda7: test
%MOUNT-I-MOUNTED, TEST mounted on _$12$LDA7:
$ dir/size sys$login:named-060_a052.a

Directory DISK$QUORUM:[MCKINNEY]

NAMED-060_A052.A;1 44352

Total of 1 file, 44352 blocks.
$ back/log sys$login:named-060_a052.a $12$lda7:[000000]
%BACKUP-S-CREATED, created $12$LDA7:[000000]NAMED-060_A052.A;1
$ back/log sys$login:named-060_a052.a $12$lda7:[000000];2
%BACKUP-S-CREATED, created $12$LDA7:[000000]NAMED-060_A052.A;2
$ pipe dump/head/block=count=0 $12$lda7:[000000]*.a;* | -
_$ sear sys$pipe $12$,count
Dump of file $12$LDA7:[000000]NAMED-060_A052.A;2 on 1-DEC-2008 11:26:37.74
Global buffer count: 0
File entry linkcount: 0
Count: 5504 LBN: 44496
Count: 976 LBN: 50080
Count: 37920 LBN: 51072
Dump of file $12$LDA7:[000000]NAMED-060_A052.A;1 on 1-DEC-2008 11:26:37.75
Global buffer count: 0
File entry linkcount: 0
Count: 960 LBN: 64
Count: 1024 LBN: 1040
Count: 42416 LBN: 2080

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: BACKUP over DECnet, file extensions, performance

BACKUP over DECnet, file extensions, performance