Operating System - OpenVMS
1748204 Members
4097 Online
108759 Solutions
New Discussion юеВ

Never ending ZIP process ...

 
SOLVED
Go to solution
Art Wiens
Respected Contributor

Never ending ZIP process ...

actually it did end when I killed it 24 hours later ;-)

I have a directory with > 250,000 text files (15 months of invoices) and I want to zip up the oldest three months. I used this command:

$ zip_cli disk:[dir.subdir.2006]2006_2.zip *.txt;* /bef=01-oct-2006 /move /keep /vms /nofull_path

The process ran for 24 hours without producing "anything". 48M i/o's and nothing to show for it ;-)

Would it be "faster" to use the /INLIST parameter? Can (should) the format of the INLIST file be the result of a DIR/NOHEAD/NOTRAIL/OUT= , or should it just be a list of filenames only ie. no disk/dir info?

I am presently renaming the files I want to zip to a temporary directory ... almost finished July 2006!

Cheers,
Art
27 REPLIES 27
Art Wiens
Respected Contributor

Re: Never ending ZIP process ...

FYI ...

This is Zip 2.32 (June 19th 2006), by Info-ZIP.

Alpha 800, VMS 7.2-2, HSG80 disk

Art
Hoff
Honored Contributor

Re: Never ending ZIP process ...

Big directories are slow and bigger still is slower still, and bigger is massively slower prior to V7.2.

And is zip archive data size beyond the limits? zip tips over rather ungracefully somewhere between 2GB and 4GB. This prior to the beta versions 3 and 6 of zip and unzip. Specific details on the triggering limits are included on the info-zip site.

I'd move each of oldest three months out of the directory (COPY or RENAME) into a set of scratch directories, and package those individually from there. This would help determine if this a data size or directory size issue.

If you're so inclined, there are always the undocumented and unsupported BACKUP compression capabilities latent in V8.3.

And a quarter-million files in a directory is certainly also an application issue. That scale has never been something OpenVMS has been good at. This area would be fodder for application changes; a quarter-million individual files in a directory just doesn't scale... (And I have to wonder if the files are used with sufficient frequency, or if there's a better model for providing whatever or however these files are used.)

A directory such as this one -- with a quarter-million files -- is unlikely to see significant improvements in the future. (Well, not without significant ancillary upheaval.) (Part of what's going on here can be masked by always adding and removing files at the end of the directory and by avoiding certain operations, but the design of directories on ODS-2 and ODS-5 is quite simply not scaling. Scaling is probably the largest area where ODS-2 and ODS-5 are showing the relative age of the underlying designs -- a quarter-million files and a terabyte of storage just wasn't something considered reasonable way back when.)
Steven Schweda
Honored Contributor

Re: Never ending ZIP process ...

On the bright side, Zip 3.0 should emit a
friendly message like:
Scanning files ..[...]
in this situation. (With _lots_ of dots,
potentially. I believe that this was added
in an attempt to soothe anxious users in just
this situation.)

As for what would be faster, I don't know.

I believe that /INLIST (-i) wants the same
kind of pattern specs as you'd put on the
command line, so a device spec should be
stripped off, but a directory spec will be
used. (A quick test should reveal all.)
Knowing nothing, I tend to doubt that
scanning a name list will be particularly
faster (or slower) than doing the wildcard
expansion. It'll still do for every name the
same stuff it always does.

I can't admit to doing any performance
analysis on the file scanning code, so I
don't know what it's doing that takes all the
time. (Gathering exciting file attribute
data, perhaps?)

It might work better to create the archive
with some subset of the total file list, and
then add additional subsets using separate
Zip commands. This might increase the total
time, but you could get some results
(messages and actual archive data) sooner.
(I don't actually do this much, but it _is_
supposed to work. That's comforting, right?)
Art Wiens
Respected Contributor

Re: Never ending ZIP process ...

Hoff, I totally agree with your take on the situation ... that many files in a directory is "insane". Trying to do anything in that dir takes much patience! I have spoken to the application people about this and it's on their "todo" list, but it's not their highest priority. Beyond that I really wonder if there isn't something wrong with the app. Considering the business they're in, 15 - 20,000 invoices a month doesn't seem reasonable (to me).

Steven, if the list does not contain wildcards ie. exact filenames, would it not have to spend much less time compared to getting the timestamps from 250,000 files and "doing the math" to match the /before= switch?

Thanks,
Art
Steven Schweda
Honored Contributor

Re: Never ending ZIP process ...

The /BEFORE (-tt) may be slowing it down, but
eliminating that will not eliminate the whole
file scanning task. Only one way to find out.
Dean McGorrill
Valued Contributor

Re: Never ending ZIP process ...

>I have a directory with > 250,000 text files (15 months of invoices) and I want to

So many problems have been fixed by not piling that many files into a single directory.
As you move stuff out of the directory, there was a directory
file 'defragger' tool around once. I'm not sure if its in the freeware stuff, but it
sure used to speed things up a lot. Dean
Hein van den Heuvel
Honored Contributor

Re: Never ending ZIP process ...

If you are renaming 10's of thousands of files from a directory the 'normal way', then you will suffer maximum hurting due to directory block recovery.

You'll tackle 'low' blocks first, and as they empty the system will shuffle down the rest of the directory (a block at a time for old vms version, 64 blocks at a time by default for current versions).

You may want to consider a two tiered approach, and you may want to consider a program, not a dcl script.

Rename in reverse order, out of the original, to a helper, enforcing an increasing order. Next, rename from the helper, again from high to low, into the real target in real order.

I thought I'd give you a quick perl script to demonstrate. That works, but it got out of hand some.
In this case perl tried to help too much.

1) If it sees a wildcard on the command line, then it will expand that into a list of files.

2) The glob function returns fully expanded names, which you then have to take apart to get the directory. A simple directory might be nicer for that.

Full example below...

Hein.


$
$ dir [.tmp*]

Directory [.TMP]

2005MIES.TMP;1 2006AAP.TMP;1 2006MIES.TMP;2 2006MIES.TMP;1
2006NOOT.TMP;1 2007MIES.TMP;1 FOO.DIR;1 NEWSRC.;1
OTS_HIST_DETAIL.DIR;1

Total of 9 files.
$ perl rename.pl """[.tmp]2006*.*"""
wild: "[.tmp]2006*.*"
3 files moved to [.tmp_helper]

Directory [.TMP_HELPER]

0000002006AAP.TMP;1 0000012006MIES.TMP;1
0000022006NOOT.TMP;1

Total of 3 files.
$ dir [.tmp*]

Directory [.TMP]

2005MIES.TMP;1 2006MIES.TMP;1 2007MIES.TMP;1 FOO.DIR;1
NEWSRC.;1 OTS_HIST_DETAIL.DIR;1

Total of 6 files.

Directory [.TMP_RENAMED]

2006AAP.TMP;1 2006MIES.TMP;1 2006NOOT.TMP;1

Total of 3 files.

Grand total of 2 directories, 9 files.

$ type rename.pl
use strict;
#use warnings;

my $HELPER = "[.tmp_helper]";
my $TARGET = "[.tmp_renamed]";

my ($i,$file);
$_ = shift or die "Please provid double quoted wildcard filespec";

print "wild: $_\n";
s/"//g;
my @files = glob;
die "Please provide double quoted wildcard filespec" if @files < 2;

# phase 1

for ($i=0; $i<@files; $i++) {
my $old = $files[$i];
my $name = (split /\]/,$old)[1];
my $new = sprintf("%s%06d%s",$HELPER,$i,$name);
# print "$old --> $new\n";
rename $old, $new;
}

print "$i files moved to $HELPER\n";
system ("DIRECTORY $HELPER");
# phase 2

while ($i-- > 0) {
my $old = $files[$i];
my $name = (split /\]/,$old)[1];
my $new = sprintf("%s%06d%s",$HELPER,$i,$name);
rename sprintf("%s%06d%s",$HELPER,$i,$name), $TARGET.$name;
}
Jon Pinkley
Honored Contributor

Re: Never ending ZIP process ...

The BACKUP command has a /fast switch that tells it to scan the INDEXF.SYS file linearly vs looking in everay directory and randomly accessing file headers in INDEXF.SYS.

For the type of operation you are doing, that method would be faster if you were using backup. Perhaps you can talk Steven into adding such an option to zip 3.x :-)

Art, when you said it didn't produce anything, are you saying is hadn't created any temporary file? How did you determine that it hadn't? (i.e. did you use ANALYZE/SYSTEM; set proc/in=xxxx ; sho proces/chan or something else?)

Given your present situaltion, the following may work, although I don't know if there are any size limits on what can be presented to zip_cli as batch input.

You could use DFU search /cre=(since=xxx,before=yyy) /file=zzz /sort /out=xxx.lis (I would suggest using a batch job)

Then edit the output to remove the extra junk, and use that as a batch input file to zip_cli?

Jon
it depends
Hoff
Honored Contributor

Re: Never ending ZIP process ...

>>> As you move stuff out of the directory, there was a directory file 'defragger' tool around once. <<<

Directories are contiguous, and are managed and maintained by the file system in sorted order.

There are fast(er) delete tools (eg: the so-called reverse delete that was mentioned), and the changes in caching in V7.2 improve performance, but there are inherent limits in how the processing occurs in directory files.

There are tweaks around that increase the amount of deleted storage that can be present in a directory before the compression starts.

Hein, when did the directory I/O fix hit? V7.2? Prior to this fix, directory I/O was one-block I/O, and the directory deletion block shuffle performance was really really really bad. In releases containing the file, bigger I/O is used. And 64 blocks? ISTR the directory-shuffle I/O was 127 blocks.

Regardless, delete is still N^2 in the directory; the earlier/lower the file sort order, the worse the performance. And worse gets bad fast.

When extending this directory, as you add files is also going to be slow -- even if the insertion order is chosen carefully. If the insertion order chosen was not chosen with case, well...

The other potential option here is a performance workaround; to move to silly-fast storage. DECram, for instance.