Re: Calculating block & file size of files in multiple savesets

Kenneth Toler · ‎06-06-2007

If all I have is the compressed savesets, is there a way to calculate the total uncompressed size for the total block & file size of multiple savesets?

I am trying to let the user know when they have selected too many savesets.

I need this info to compare to the free blocks on the hard drive.

If the total number of blocks required to expand the multiple savesets exceeds the total available blocks on the hard drive, then I will alert the user to de-select savesets until it is within the memory left on the hard drive.

labadie_1 · ‎06-06-2007

>>> If all I have is the compressed savesets, is there a way to calculate the total uncompressed size for the total block & file size of multiple savesets?

No, as some data compress a lot and others not. You can't say "I have 100 M compressed, so I will always get 136,3 M of uncompressed data."

Today, disk space is cheap. May be the simplest solution is to have a huge free space.

Hoff · ‎06-06-2007

I appreciate what you are trying to do here -- pre-emptively detecting this case and issuing an error message -- but there's no reliable means to externally calculate the efficiency of compression.

You can certainly guess (eg: 2.0 to 2.5x is an oft-quoted ratio, or you could determine this ratio approximation empirically for your data) but there's no way I'm aware of to know what the input volume was, unless you have some control over the process and record this material yourself. This could be via a metadata file stored on the media in parallel with the compressed saveset, or via a metadata file inserted into the saveset, or via tagging the saveset with a site-specific ACE containing this data (this latter case assuming no use of /interchange).

Or you could wing it, restore the saveset, and catch the error when it arises. In most cases, I'd tend to code to catch the allocation error regardless, as -- barring a completely quiescent target disk -- there can be a parallel allocation that arises during the restoration, and derails your restoration process. Even with a careful and correct check, the restoration can still fail.

There is a semi-related discussion in the OpenVMS FAQ, around estimating available capacity of output media when generating a compressed saveset via BACKUP compression or drive-level compression http://64.223.189.234/node/1 This is the logical reverse of your question, that too is related to determining compression efficiency.

Stephen Hoffman
BoffmanLabs LLC

labadie_1 · ‎06-06-2007

By the way, which tool do you use to compress ?

zip, gzip, bzip2, any other ?

Kenneth Toler · ‎06-06-2007

I used the backup command for OpenVMS 7.3-2

Hoff · ‎06-06-2007

BACKUP has an undocumented compression capability circa V8.3, but (unless that got backported and released and documented) you're probably not using it on V7.3-2; BACKUP itself does not compress data by default, and the aggregate saveset output of BACKUP is probably somewhere around 10% larger than the aggregate input given the default grouping and recovery data added into the saveset by BACKUP; the saveset is somewhat larger than the input data. (You can probably dial this ratio in yourself, depending on your particular BACKUP command.)

The tape drive can compress data and that data can include BACKUP saveets, and the same difficulty in estimating compression efficiency exists there.

The BACKUP compression command is BACKUP /DATA_FORMAT=COMPRESSED -- AFAIK, this is latent and undocumented, and not (yet?) supported. It's been discussed in various forums around the net.

There's a write-up on BACKUP, compression and encryption I/O throughput here: http://64.223.189.234/node/85
This topic doesn't cover the compression efficiency.

Steven Schweda · ‎06-06-2007

> By the way, which tool do you use to compress ?
> zip, gzip, bzip2, any other ?

> I used the backup command for OpenVMS 7.3-2

So, how is that _compressed_? Or, by
"compressed", did you just mean "collected"
(into a BACKUP save set)?

Not that it's likely to matter here, but note
that a Zip archive includes uncompressed and
compressed size data for the files it
contains, and zipinfo (unzip -Z) can reveal
such information without actually unpacking
the archive.

And the size of a(n uncompressed) BACKUP save
set is at least a fair guide to the size of
the data therein.

Dean McGorrill · ‎06-06-2007

>I used the backup command for OpenVMS 7.3-2

did you make a saveset of savesets? if so

$ back/list yoursaveset.bck/save

will give you the total blocks used.

Kenneth Toler · ‎06-06-2007

I am dealing with multiple independent savesets, not savesets within savesets.

I want to let the end user know when they have selected too many savesets.

Kenneth Toler · ‎06-06-2007

Doesn't the following give you the compressed blocksize?

"$ back/list yoursaveset.bck/save

will give you the total blocks used."

I am looking for the uncompressed block size.

Dean McGorrill · ‎06-06-2007

Kenneth,
if you just used plain backup then
its not compressed. its around 10% +- larger
then the blocks listed with a backup/list.
I guess we are not sure how you compressed
the savesets. backup which won't do it ?

Jan van den Ende · ‎06-06-2007

Kenneth,

your answer is much closer then you are seeking it!

>>>"$ back/list yoursaveset.bck/save

will give you the total blocks used."

I am looking for the uncompressed block size.
<<<

WHAT that will give you, is NOT the saveset number of blocks, but the blocks READ in creating the saveset, ie, the number of blocks you will get upon restore.
(well, DO allow for clustersize uprounding. So, add approx 1/2 * (number-of-files-in saveset) * targetvolume clustersize.)

hth

Proost.

Have one on me.

jpe

Don't rust yours pelled jacker to fine doll missed aches.

Kenneth Toler · ‎06-06-2007

>>>"$ back/list yoursaveset.bck/save

will give you the total blocks used."

<<<

Based on the back/list command above, this could take a while for the entire saveset to complete the listing. This is especially true in my case where one saveset can contain as as many as 300,000 to 500,000 files each.

Is there a quick way to extract the line that contains the number of blocks and files for the entire saveset?

So, add approx 1/2 * (number-of-files-in saveset) * targetvolume clustersize.)

Finally, how do I determine the target volume cluster size?

Robert Brooks_1 · ‎06-06-2007

Finally, how do I determine the target volume cluster size?

--

cluster_size = f$getdvi( , "cluster")

where devnam is the name of the mounted target device

-- Rob

Robert Gezelter · ‎06-06-2007

Kenneth,

The problem as posed has several hazards:

- There may be files that were marked NOBACKUP in the saveset (and thus not saved) that WILL occupy space when restored
- If the saveset is stored on a sequential device (tape or simulated tape), then there is no way to determine the length of the saveset without reading through the entire set
- The "breakage" factor relating to the disk cluster size and the BACKUP record size

There are probably a few cases that I missed in the above.

The bottom line is that without parsing the output of a BACKUP/LIST of the saveset, I doubt that it is possible to come up with a truly reliable number.

It is important to note that hardware compression is below the user's visibility in this case. The case of the NOBACKUP files is effectively an example of an optimization within the saveset and is an issue.

One option is to do the restore to a scratch volume, where users have far more space available than normal volumes. The operation can then be staged onto the actual destination.

As usual, the depth of the response is limited by the details of the target environment.

I hope that the above is helpful.

- Bob Gezelter, http://www.rlgsc.com

Jon Pinkley · ‎06-08-2007

Kenneth,

This is the fourth question about what appears to be the same problem.

PKZIP for VMS vs. backup/log http://forums1.itrc.hp.com/service/forums/questionanswer.do?threadId=1114625
Total Number of Files and Blocks inside savesets http://forums1.itrc.hp.com/service/forums/questionanswer.do?threadId=1121642
Need to speed up expansion of very large savesets http://forums1.itrc.hp.com/service/forums/questionanswer.do?threadId=1122470
Calculating block & file size of files in multiple savesets http://forums1.itrc.hp.com/service/forums/questionanswer.do?threadId=1133933

Can you please provide a few more details about the actual problem you are trying to solve? We are answering the specific questions you ask, but the answers don't seem to be solving your real problem.

This appears to be a data transfer issue, not an archival issue.

Reading between the lines, it seems you have a process that creates many (300,000 - 500,000 individual files containing a total of more than 50 GB of data) on an ongoing basis. This data is delivered to another party periodically. The apparent problem is that the "customer" is complaining about how long it takes to get the data into a form that they can process it.

Once the customer "unloads" the data to their disk, what do they do with it? After they process it, do they delete it to make room for the next set of data? Specifically, do they process it multiple times, or do they only need to process it once in its raw form? E.g. if they are reading data, and loading it into another database, then once they have processed it, they no longer need the original data.

The reason I ask is that if they are processing the data only once, then if is possible for you to change your procedure in the collection of the data, you will be able to provide the data in a form that will be usable in a very short period of time from the customer's point of view.

If they are only processing the data once, and you don't need a copy of the original data, you can create the data on a removable disk that you deliver to them once the disk is "full". If you had two disks, you could exchange disks (double buffering) but if you need to have a drive available for collected data at all times, then you will need to get the previous disk back before your primary disk fills, i.e. you may need more than two drives. In a previous thread I suggested the use of an LD container file as the "drive", but you seemed reluctant to use LDDRIVER.

The modified procedure to transfer data would be:

At your site:

1. Prepare collection/transfer disk. (Connect, Initialize, Mount)
2. Store data to disk until disk nearly full.
3. Remove disk, send to data consumer.
4. goto step 1

At customer site:

1. Ready input disk (Connect, Mount)
2. Process data
3. Remove disk, sent to data provider.
4. Goto step 1.

The customer can be processing one set of data while you are collecting/generating the next.

Note in this scenario, no backups/restores are done. It is just mount and go. If you do need to keep a copy of the data, you will need to do a backup of the drive before it is sent, or you will need to use HBVS to another drive (which can be an LD device), so the data is copied into two places as it is saved.

The disk can be either a removable SCSI disk or an LD container file. The procedure is essentially the same. The key is that you are providing them with a disk that has the files in a usable state without the need to do an unload (i.e. a restore of a backup saveset or an unzip of a zip file)

Answers to your specific questions:

The only way to get an accurate estimate of the output size of an arbitrary saveset is what is reported by backup/list. However, this data does not change, and you can create a listing file at the time of the initial backup (just include /list=file in the backup command that is creating the saveset). Once the saveset is created, the time consuming process of listing the contents does not need to be done again. Deliver the listing file along with the backup saveset (assuming you are not going to use my proposed solution, in which case, the listing isn't needed).

In your case, as long as you do not have files that are marked /nobackup, and you are not using data compression, and you are creating the backup savesets on disk, and you specify /group=0 (no redundancy), the size of the saveset will be a good approximation of the size of the restored data if restoring to a disk with a cluster size of 1. But you would want to require more space than the size of the saveset, as you would not want to run out of space on a restore. This problem can be avoided by just exchanging drives.

Jon

it depends

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Calculating block & file size of files in multiple savesets

Calculating block & file size of files in multiple savesets