Operating System - OpenVMS
cancel
Showing results for 
Search instead for 
Did you mean: 

Never ending ZIP process ...

SOLVED
Go to solution
Art Wiens
Respected Contributor

Never ending ZIP process ...

actually it did end when I killed it 24 hours later ;-)

I have a directory with > 250,000 text files (15 months of invoices) and I want to zip up the oldest three months. I used this command:

$ zip_cli disk:[dir.subdir.2006]2006_2.zip *.txt;* /bef=01-oct-2006 /move /keep /vms /nofull_path

The process ran for 24 hours without producing "anything". 48M i/o's and nothing to show for it ;-)

Would it be "faster" to use the /INLIST parameter? Can (should) the format of the INLIST file be the result of a DIR/NOHEAD/NOTRAIL/OUT= , or should it just be a list of filenames only ie. no disk/dir info?

I am presently renaming the files I want to zip to a temporary directory ... almost finished July 2006!

Cheers,
Art
27 REPLIES
Art Wiens
Respected Contributor

Re: Never ending ZIP process ...

FYI ...

This is Zip 2.32 (June 19th 2006), by Info-ZIP.

Alpha 800, VMS 7.2-2, HSG80 disk

Art
Hoff
Honored Contributor

Re: Never ending ZIP process ...

Big directories are slow and bigger still is slower still, and bigger is massively slower prior to V7.2.

And is zip archive data size beyond the limits? zip tips over rather ungracefully somewhere between 2GB and 4GB. This prior to the beta versions 3 and 6 of zip and unzip. Specific details on the triggering limits are included on the info-zip site.

I'd move each of oldest three months out of the directory (COPY or RENAME) into a set of scratch directories, and package those individually from there. This would help determine if this a data size or directory size issue.

If you're so inclined, there are always the undocumented and unsupported BACKUP compression capabilities latent in V8.3.

And a quarter-million files in a directory is certainly also an application issue. That scale has never been something OpenVMS has been good at. This area would be fodder for application changes; a quarter-million individual files in a directory just doesn't scale... (And I have to wonder if the files are used with sufficient frequency, or if there's a better model for providing whatever or however these files are used.)

A directory such as this one -- with a quarter-million files -- is unlikely to see significant improvements in the future. (Well, not without significant ancillary upheaval.) (Part of what's going on here can be masked by always adding and removing files at the end of the directory and by avoiding certain operations, but the design of directories on ODS-2 and ODS-5 is quite simply not scaling. Scaling is probably the largest area where ODS-2 and ODS-5 are showing the relative age of the underlying designs -- a quarter-million files and a terabyte of storage just wasn't something considered reasonable way back when.)
Steven Schweda
Honored Contributor

Re: Never ending ZIP process ...

On the bright side, Zip 3.0 should emit a
friendly message like:
Scanning files ..[...]
in this situation. (With _lots_ of dots,
potentially. I believe that this was added
in an attempt to soothe anxious users in just
this situation.)

As for what would be faster, I don't know.

I believe that /INLIST (-i) wants the same
kind of pattern specs as you'd put on the
command line, so a device spec should be
stripped off, but a directory spec will be
used. (A quick test should reveal all.)
Knowing nothing, I tend to doubt that
scanning a name list will be particularly
faster (or slower) than doing the wildcard
expansion. It'll still do for every name the
same stuff it always does.

I can't admit to doing any performance
analysis on the file scanning code, so I
don't know what it's doing that takes all the
time. (Gathering exciting file attribute
data, perhaps?)

It might work better to create the archive
with some subset of the total file list, and
then add additional subsets using separate
Zip commands. This might increase the total
time, but you could get some results
(messages and actual archive data) sooner.
(I don't actually do this much, but it _is_
supposed to work. That's comforting, right?)
Art Wiens
Respected Contributor

Re: Never ending ZIP process ...

Hoff, I totally agree with your take on the situation ... that many files in a directory is "insane". Trying to do anything in that dir takes much patience! I have spoken to the application people about this and it's on their "todo" list, but it's not their highest priority. Beyond that I really wonder if there isn't something wrong with the app. Considering the business they're in, 15 - 20,000 invoices a month doesn't seem reasonable (to me).

Steven, if the list does not contain wildcards ie. exact filenames, would it not have to spend much less time compared to getting the timestamps from 250,000 files and "doing the math" to match the /before= switch?

Thanks,
Art
Steven Schweda
Honored Contributor

Re: Never ending ZIP process ...

The /BEFORE (-tt) may be slowing it down, but
eliminating that will not eliminate the whole
file scanning task. Only one way to find out.
Dean McGorrill
Valued Contributor

Re: Never ending ZIP process ...

>I have a directory with > 250,000 text files (15 months of invoices) and I want to

So many problems have been fixed by not piling that many files into a single directory.
As you move stuff out of the directory, there was a directory
file 'defragger' tool around once. I'm not sure if its in the freeware stuff, but it
sure used to speed things up a lot. Dean
Hein van den Heuvel
Honored Contributor

Re: Never ending ZIP process ...

If you are renaming 10's of thousands of files from a directory the 'normal way', then you will suffer maximum hurting due to directory block recovery.

You'll tackle 'low' blocks first, and as they empty the system will shuffle down the rest of the directory (a block at a time for old vms version, 64 blocks at a time by default for current versions).

You may want to consider a two tiered approach, and you may want to consider a program, not a dcl script.

Rename in reverse order, out of the original, to a helper, enforcing an increasing order. Next, rename from the helper, again from high to low, into the real target in real order.

I thought I'd give you a quick perl script to demonstrate. That works, but it got out of hand some.
In this case perl tried to help too much.

1) If it sees a wildcard on the command line, then it will expand that into a list of files.

2) The glob function returns fully expanded names, which you then have to take apart to get the directory. A simple directory might be nicer for that.

Full example below...

Hein.


$
$ dir [.tmp*]

Directory [.TMP]

2005MIES.TMP;1 2006AAP.TMP;1 2006MIES.TMP;2 2006MIES.TMP;1
2006NOOT.TMP;1 2007MIES.TMP;1 FOO.DIR;1 NEWSRC.;1
OTS_HIST_DETAIL.DIR;1

Total of 9 files.
$ perl rename.pl """[.tmp]2006*.*"""
wild: "[.tmp]2006*.*"
3 files moved to [.tmp_helper]

Directory [.TMP_HELPER]

0000002006AAP.TMP;1 0000012006MIES.TMP;1
0000022006NOOT.TMP;1

Total of 3 files.
$ dir [.tmp*]

Directory [.TMP]

2005MIES.TMP;1 2006MIES.TMP;1 2007MIES.TMP;1 FOO.DIR;1
NEWSRC.;1 OTS_HIST_DETAIL.DIR;1

Total of 6 files.

Directory [.TMP_RENAMED]

2006AAP.TMP;1 2006MIES.TMP;1 2006NOOT.TMP;1

Total of 3 files.

Grand total of 2 directories, 9 files.

$ type rename.pl
use strict;
#use warnings;

my $HELPER = "[.tmp_helper]";
my $TARGET = "[.tmp_renamed]";

my ($i,$file);
$_ = shift or die "Please provid double quoted wildcard filespec";

print "wild: $_\n";
s/"//g;
my @files = glob;
die "Please provide double quoted wildcard filespec" if @files < 2;

# phase 1

for ($i=0; $i<@files; $i++) {
my $old = $files[$i];
my $name = (split /\]/,$old)[1];
my $new = sprintf("%s%06d%s",$HELPER,$i,$name);
# print "$old --> $new\n";
rename $old, $new;
}

print "$i files moved to $HELPER\n";
system ("DIRECTORY $HELPER");
# phase 2

while ($i-- > 0) {
my $old = $files[$i];
my $name = (split /\]/,$old)[1];
my $new = sprintf("%s%06d%s",$HELPER,$i,$name);
rename sprintf("%s%06d%s",$HELPER,$i,$name), $TARGET.$name;
}
Jon Pinkley
Honored Contributor

Re: Never ending ZIP process ...

The BACKUP command has a /fast switch that tells it to scan the INDEXF.SYS file linearly vs looking in everay directory and randomly accessing file headers in INDEXF.SYS.

For the type of operation you are doing, that method would be faster if you were using backup. Perhaps you can talk Steven into adding such an option to zip 3.x :-)

Art, when you said it didn't produce anything, are you saying is hadn't created any temporary file? How did you determine that it hadn't? (i.e. did you use ANALYZE/SYSTEM; set proc/in=xxxx ; sho proces/chan or something else?)

Given your present situaltion, the following may work, although I don't know if there are any size limits on what can be presented to zip_cli as batch input.

You could use DFU search /cre=(since=xxx,before=yyy) /file=zzz /sort /out=xxx.lis (I would suggest using a batch job)

Then edit the output to remove the extra junk, and use that as a batch input file to zip_cli?

Jon
it depends
Hoff
Honored Contributor

Re: Never ending ZIP process ...

>>> As you move stuff out of the directory, there was a directory file 'defragger' tool around once. <<<

Directories are contiguous, and are managed and maintained by the file system in sorted order.

There are fast(er) delete tools (eg: the so-called reverse delete that was mentioned), and the changes in caching in V7.2 improve performance, but there are inherent limits in how the processing occurs in directory files.

There are tweaks around that increase the amount of deleted storage that can be present in a directory before the compression starts.

Hein, when did the directory I/O fix hit? V7.2? Prior to this fix, directory I/O was one-block I/O, and the directory deletion block shuffle performance was really really really bad. In releases containing the file, bigger I/O is used. And 64 blocks? ISTR the directory-shuffle I/O was 127 blocks.

Regardless, delete is still N^2 in the directory; the earlier/lower the file sort order, the worse the performance. And worse gets bad fast.

When extending this directory, as you add files is also going to be slow -- even if the insertion order is chosen carefully. If the insertion order chosen was not chosen with case, well...

The other potential option here is a performance workaround; to move to silly-fast storage. DECram, for instance.

Hein van den Heuvel
Honored Contributor

Re: Never ending ZIP process ...

For directories with lots of litlle files yo may also want to consider an LD devices on a file.

The LD device could then have a cluster size which works wel with the typical smal file size.

When all files are collected, dismount the LD device and zip up the file containing the device !?

Applications creating lots of file in a single directory can often, but not always, be helped transparently by using SEARCH LIST. Define that target directory as [current],[rest].
That could also be [200710],[200709],[2007],[2006]
Once a month, or once a year, or whenever appropriate, you push a new empty, but pre-allocated dircetory in from of the definition and perhaps clean up the tail.


Btw.. on my last comment on the perl helping too much, that really applies to a simple DIRECTORY also, unless your default is actually in the directory itself.

Below an example of that, still using perl.
By using the directory command, all the the slection switches (/BEFORE...) are aviable.
Perl has those also, but we all know directory better.


Hein,


$ perl [-]rename_dir.pl """2006*.*"""
wild: "2006*.*"
2006AAP.TMP;1 --> [-.tmp_helper]0000002006AAP.TMP;1
2006MIES.TMP;2 --> [-.tmp_helper]0000012006MIES.TMP;2
2006MIES.TMP;1 --> [-.tmp_helper]0000022006MIES.TMP;1
2006NOOT.TMP;1 --> [-.tmp_helper]0000032006NOOT.TMP;1
4 files moved to [-.tmp_helper]

Directory [HEIN.TMP_HELPER]

0000002006AAP.TMP;1 0000012006MIES.TMP;2
0000022006MIES.TMP;1 0000032006NOOT.TMP;1

Total of 4 files.
$ dir [-.tmp*]

Directory [HEIN.TMP]

2005MIES.TMP;1 2007MIES.TMP;1 FOO.DIR;1 NEWSRC.;1
OTS_HIST_DETAIL.DIR;1

Total of 5 files.

Directory [HEIN.TMP_RENAMED]

2006AAP.TMP;1 2006MIES.TMP;2 2006MIES.TMP;1 2006NOOT.TMP;1

Total of 4 files.

Grand total of 2 directories, 9 files.
$ type [-]rename_dir.pl
use strict;
#use warnings;

my $HELPER = "[-.tmp_helper]";
my $TARGET = "[-.tmp_renamed]";
my $i = 0;
my @files;
$_ = shift or die "Please provid double quoted wildcard filespec";

print "wild: $_\n";
s/"//g;
my $wild = $_;
foreach (`DIRECTORY/COLU=1 $wild`) {
chomp;
$files[$i++] = $_ if /;/;
}
die "Please provide double quoted wildcard filespec" if @files < 2;

# phase 1

for ($i=0; $i<@files; $i++) {
my $name = $files[$i];
my $new = sprintf("%s%06d%s",$HELPER,$i,$name);
print "$name --> $new\n";
rename $name, $new;
}

print "$i files moved to $HELPER\n";
system ("DIRECTORY $HELPER");
# phase 2

while ($i-- > 0) {
my $name = $files[$i];
rename sprintf("%s%06d%s",$HELPER,$i,$name), $TARGET.$name;
}



Hein van den Heuvel
Honored Contributor

Re: Never ending ZIP process ...

Hoff,

I think Andy did the directory shuffle improvements in 7.2 but I'm travelling and do not have my (VMS)notes here to verify that.

The actual IO size used is defined by SYSGEN param: acp_maxread

http://h71000.www7.hp.com/DOC/82FINAL/6048/6048pro_090.html

All,

Please find attached a special XFC trace our friend Mark H made, showing the XQP directory management for behaviour. The example is for a delete. For a rename the bitmap writes will go away, and the indexf.sys writes might go away (directory backlink update!?) but the directory IO is exactly the same on the old side and joied with directory IO fo rthe new name.

The trace shows most recent on top.
So you need to start at teh bottom
Te first few deletes are 'easy'.
Then about 20 lines from the bottom a delete empties a directory blocks and triggers a down shuffle.

Enjoy!
Hein van den Heuvel
HvdH Performance Consulting.


Art Wiens
Respected Contributor

Re: Never ending ZIP process ...

What would the syntax be to use /inlist= ?

$ zip_cli test.zip /inlist=test_tozip.lis /move/keep/vms/nofull_path

zip error: Invalid command arguments (nothing to select from)

Where the inlist file has:

$ ty test_tozip.lis
TEST.COM;13
TEST.COM;12
TEST.COM;11
TEST.COM;10
TEST.COM;9
TEST.COM;8
TEST.COM;7
TEST.COM;6
TEST.COM;5
TEST.COM;4
TEST.COM;3
TEST.COM;2
TEST.COM;1

Art
Art Wiens
Respected Contributor

Re: Never ending ZIP process ...

Or should I actually be using the /batch= qualifier rather than /inlist= ?

Art
Art Wiens
Respected Contributor

Re: Never ending ZIP process ...

Yes Art, use the /batch qualifier:

$ zip_cli test.zip /batch=test_tozip.lis /move/keep/vms/nofull_path
adding: TEST.COM;13 (deflated 45%)
adding: TEST.COM;12 (deflated 12%)
adding: TEST.COM;11 (deflated 15%)
adding: TEST.COM;10 (deflated 15%)
adding: TEST.COM;9 (deflated 10%)
adding: TEST.COM;8 (deflated 4%)
adding: TEST.COM;7 (deflated 10%)
adding: TEST.COM;6 (deflated 10%)
adding: TEST.COM;5 (deflated 1%)
adding: TEST.COM;4 (deflated 1%)
adding: TEST.COM;3 (deflated 63%)
adding: TEST.COM;2 (deflated 63%)
adding: TEST.COM;1 (deflated 63%)

Cheers,
Art
Steven Schweda
Honored Contributor

Re: Never ending ZIP process ...

> Yes Art, use the /batch qualifier:

Oops. Yes. Sorry. /EXCLUDE (-x), /EXLIST
(-x@), /INCLUDE (-i), and /INLIST (-i@) are
for file name filtering. /BATCH (-@) is for
the actual file name specification task.

Boy, you just can't trust me for anything.
Dean McGorrill
Valued Contributor

Re: Never ending ZIP process ...

>>> As you move stuff out of the directory, there was a directory file 'defragger' tool around once. <<<

hoff this was some tool one of the system managers had to shrink a directory, supposedly worked faster. I didn't trust it.
reason I rember it was it broke on him once
doing users mail and corrupted the directory file. wound up picking up all the
files out of syslost, he got em back with anal/disk/repair. (unhappy user)
Hein van den Heuvel
Honored Contributor

Re: Never ending ZIP process ...

Re: Directory compression

Well, there used to be a tool called 'comdir' we (Colin Blake) developped in the 80's in Valbonne.

DFU picked this up in the DIRECTORY /COMPRESS command with optional /FILL

DFU
DIRECTORY
/FILL_FACTOR=percentage

This qualifier is only valid in combination with /COMPRESS. Default behaviour for DFU is to compress a directory as tight as possible; this is equivalent to /FILL_FACTOR=100. By choosing a lower fill_factor DFU will leave some free space in each directory block. /FILL_FACTOR may be between 50 and 100 %. Caution : choosing a fill_factor lower than 100% may fail if the directory file is not large enough. In that case DFU will signal an error and advise using a higher fill_factor.


Hein.

Wim Van den Wyngaert
Honored Contributor

Re: Never ending ZIP process ...

If the application uses a logical to access the files, you can use a search list to force creation in smaller directories (without modifiying the application itself.

define logic dev:[2006.12],dev:[2006.11]

create logic:a.txt

The file will be created in the 12 directory.

In January you simply change it to

define logic dev:[2007.1],dev:[2006.12],dev:[2006.11]

and files are created in the 1 directory.

Thus you can zip a much smaller directroy.

This will of course not solve your problem now because tou already are stuck with the big directory.

Wim
Wim
labadie_1
Honored Contributor

Re: Never ending ZIP process ...

A copy of an old post


I suppose that your appli puts files in a directory named disk$appli,
and that your appli is started every morning and shut every evening.

May be you should do the following, define disk$appli as a search-list

The following will automagically roll

def disk$appli disk:-
disk:-
disk:-
disk:-
disk:-
disk:-
disk:

Of course you have to create your directory <.monday> and so

You can of course use a little dcl to have a search-list with many more
elements, and make it roll.

On monday you can quietly move the file in <.tuesday> and other
directories to otherwhere, while not disturbing the application.
Jan van den Ende
Honored Contributor

Re: Never ending ZIP process ...

Art,

I met a situation not unlike yours, albeit a little bit smaller.

_DO_ consider Heins SEARCHLIST solution!

And Wim wrote
>>>
This will of course not solve your problem now because tou already are stuck with the big directory.
<<<

Well, that is true, but not completely.

You _CAN_ migrate to an easier structure, but getting there in a more or less system-friendly way DOES require some extra human effort.

- decide what is a "reasonable" quantity of files per directory. (one day, one week, one month?)
- decide which of those units is already completed, and has its files furthest down in the directory file.
- create a suitably named subdirectory for those, and add it in the logical name search list (of course, the active directory goes first).
- rename the applicable files to the subdir. It will not be fast, but A LOT faster than renaming the top ones!
- repeat until all files moved.
(you may well prefer scripting the above!)
- create a temporary directory for the active (new) files
- rename everything still in the old dir to the new one.
- delete the (now empty) _BIG_ directory file.

It is a pain, but worth the effort.

And now, educate the app developpers (if they still can be traced) about their sillyness.

success.

Proost.

Have one on me.

jpe
Don't rust yours pelled jacker to fine doll missed aches.
Art Wiens
Respected Contributor

Re: Never ending ZIP process ...

I've restarted the zip process using the /batch qualifier. It's executing in a directory with ~ 60,000 files (the oldest three months I renamed out of the way). The list I gave it is ~ one third of the files. The names in the list are exact ie. no wildcards.

Jon had a question earlier what files the job had open ... as it grinds away right now, it has only the list file open specified by the /batch switch (see attached).

Previous experience with these large ZIP jobs is they don't open the temporary zip file (which gets renamed to the final zip file) until it's "processed" the selection of files. Whatever "processed" entails.

Art
Jon Pinkley
Honored Contributor

Re: Never ending ZIP process ...

Does $1$DGA79 have the files that are being zipped? That device was busy at the time you took the snapshot.

Channel Window Status Device/file accessed
------- ------ ------ --------------------
00D0 00000000 Busy $1$DGA79:
SDA>

That may be activity verifying that the files exist before starting the actual zipping. I don't know enought about the internal format of the zip container layout, perhaps it is similar to a CD in that it may need to create or at least preallocate space for the directory before it adds anything to the zip file.

Steven S. will know.

If it really needs to preallocate table of contents space, then I wonder how it handles a file being deleted between the initial scan and the archival of the file.

Seems odd it would need to scan the whole list if it is explicitly provided.

Jon
it depends
Steven Schweda
Honored Contributor
Solution

Re: Never ending ZIP process ...

> Previous experience with these large ZIP
> jobs is they don't open the temporary zip
> file (which gets renamed to the final zip
> file) until it's "processed" the selection
> of files. Whatever "processed" entails.

Sounds right to me.

> Steven S. will know.

Now _there's_ an optimist.

I believe (glancing at code I normally
ignore) that Zip runs through the whole file
list beofre it does any serious work, looking
for names to be filtered out by the include
and exclude or date-time options, duplicate
names (easy with /NOFULLPATH, -j), the name
of the archive itself, and so on, creating a
big to-do (linked) list as it goes, including
the actual file specs and the names to be
used in the archive (which may be different),
and some other data which will be stored in
the archive for each file.

> [..] I wonder how it handles a file being
> deleted between the initial scan and the
> archival of the file.

I've never tried it. I assume that you get a
complaint.
Art Wiens
Respected Contributor

Re: Never ending ZIP process ...

Things to remember for "next time".

- Don't let 250,000 files accumulate in a directory ;-)

- Use ZIP_CLI for more "familiar" DCL like qualifiers

- If you do, move the files you want to ZIP to a temporary location ie. don't try to ZIP in the directory containing 250,000 files

- Create a list of files you want to ZIP. Use it by supplying the /BATCH=listfile.txt qualifier.

- Use the /MOVE qualifier to automatically delete the files from disk after they have been added to the archive

- Have patience! It takes a lot of time!

Thanks all, I will take the help provided and try to set things up a little smarter for the future.

Cheers,
Art