Operating System - OpenVMS

ZIP performance

 
SOLVED
Go to solution
Art Wiens
Respected Contributor

Re: ZIP performance

The 800 5/500 is the "most powerful" box in this cluster.

I might try to restore this mess onto an ES47 / EVA8000, and ZIP it there, but it will still be a several hour restore.

Art
Steven Schweda
Honored Contributor

Re: ZIP performance

> but it will still be a several hour restore.

And speaking of restoring, having Zip use a
reverse-sorted file list might help it delete
the files faster, but that will leave you
with a reverse-sorted archive, and if UnZip
were used to restore those files, it would
do it in the archive-member (reverse) order,
which would be the maximum-shuffle method at
that time. It's a conspiracy.

> The 800 5/500 is the "most powerful" box in
> this cluster.

What does it use for memory? I haven't
looked lately, but I once found a whole 2GB
kit for an XP1000 for about $20 on Ebay.
(It's ECC. If it doesn't actually catch
fire, how much damage could it do?)

I can eat up 512MB with only a couple of Web
browsers.
Hein van den Heuvel
Honored Contributor

Re: ZIP performance

Hein> You have to rename in reverse order to make the rename not take too long.

Jim> Would you not then pay nearly the same price for insertion into the new directory that you save during the removal from the old directory?

The directory shuffle problem multiplies out by its size. And the more you have the more often you have to do it making it a n-square problem.

Adding a new low block to a 10,000 entry directory is 6 times faster then removing the bottom block from a 60,000 entry directory.
Add to that some better caching and there is a good net gain.

For a worse problems I've once scripted a solution with temporary file names renaming them twice from high to low into the final directory.

10,000 may already be too much for the system in question, maybe a 128 block directory should be the cutoff...

But that 128 was only really important in older versions. Around 1999 RMS dropped the 127 block limit when using its directory scans. It should allocate the whole directory in a single big buffer... for searches.

Maybe it is NOT using an RMS search, and just letting the XQP do the name-to-fid mapping.
The XQP still has this somewhat brain-dead 'lookup' table to map the first few file name characters to a vbn. But that does not help much (or any) as the directory gets big and if the files have mostly fixed leading start characters (like MAIL$, or 20090903 :-).
To make room for more pointers, it reduces number of characters from 14 for directories of less than 240 blocks down to 2 for directories over 1200 blocks. Not very selective... when you need it most

Hmmm... We have not been given us any filename examples. Would they all happen to start with a more or less fixed sequence over hundreds of files? Average size? Total directory size?
Can you try making the names shorter, notably more unique in the leading characters?
Turn "MAIL$" into "" (nothing) and "2009" into "9" as exercise?

And uh, what is SYSGEN param ACP_MAXREAD set to?


Art>> MONI MODE shows ~60% Kernel Mode, 10% Interrupt State and 30% Idle Time

So clearly ZIP is not ZIPPING, but preparing. Real zip work would show lots of USER mode.

fwiw,
Hein.

Jim_McKinney
Honored Contributor

Re: ZIP performance

> Adding a new low block to a 10,000 entry directory is 6 times faster then removing the bottom block from a 60,000 entry directory.


Thank you very much for this - it's not intuitive to me. I imagined that they were roughly equivalent.
Andy Bustamante
Honored Contributor

Re: ZIP performance

Assuming the performance issue is with the delete operations:

Rename the directory and start a new application directory. Create your Zip archives against the new directory, 5,000 or so files at time with the delete option. When you have sucessfully created your zip files, use DFU to wipe the directory.

Andy
If you don't have time to do it right, when will you have time to do it over? Reach me at first_name + "." + last_name at sysmanager net
Art Wiens
Respected Contributor

Re: ZIP performance

"but that will leave you
with a reverse-sorted archive, and if UnZip
were used to restore those files, "

Doesn't really matter ... we might have to pull 1 to 5 invoices out of an archive once in a while. The whole thing would never need to be restored.

About spending any money on this old stuff ... it's a strange "political" environment here to say the least. An extra nickel will not be spent on this hardware. We spent a million bucks three years ago on ES47's but there's no money for support resources to "port the applications". I have recompiled and linked the apps ... it all works, but there is no time to "bless it". No power users available to do user acceptance testing. Very frustrating!

"We have not been given us any filename examples."

eg:

0020000.TXT;3 11/137 23-MAY-2007 14:09:24.69
0020000.TXT;2 11/137 22-MAY-2007 15:17:50.30
0020000.TXT;1 11/137 18-MAY-2007 13:19:23.16
0020001.TXT;3 11/137 23-MAY-2007 14:09:24.73
0020001.TXT;2 11/137 22-MAY-2007 15:17:50.34
0020001.TXT;1 11/137 18-MAY-2007 13:19:23.20
0020002.TXT;3 11/137 23-MAY-2007 14:09:24.77
0020002.TXT;2 11/137 22-MAY-2007 15:17:50.48
0020002.TXT;1 11/137 18-MAY-2007 13:19:23.23
0020003.TXT;3 11/137 23-MAY-2007 14:09:24.81
0020003.TXT;2 11/137 22-MAY-2007 15:17:50.53
0020003.TXT;1 11/137 18-MAY-2007 13:19:23.27
0020004.TXT;3 11/137 23-MAY-2007 14:09:24.85
0020004.TXT;2 11/137 22-MAY-2007 15:17:50.58
0020004.TXT;1 11/137 18-MAY-2007 13:19:23.31

The file names are a relatively sequential invoice number. There are some gaps ... not sure why.

And yes, some challenged individual ran billing three times in 5 days!! But just in May 07.

The reverse list may have made some improvement! 5.5 hours in and it has the temp ZIP file open already and has added ~13,000 files to it! "Flying" now!

Cheers,
Art
Hein van den Heuvel
Honored Contributor

Re: ZIP performance

Jim wrote>> Thank you very much for this - it's not intuitive to me. I imagined that they were roughly equivalent.

Yeah,the algoritme is a simple shuffle up or down as needed when a block is filled, or emptied.
The shuffle is done in ACP_MAXREAD blocks at a time, but only shuffling 1 block up or down total.

Art,
depending on the exact distribution of file names in the first 2 dcharacters, for your order files the system has a threshold at 28,000 files.
The actual threshold is a directory size greater than 1440 blocks.

Got the proof.

I created a tine command file to populate a directory with SET FILE/ENTER commands.

And I created a tiny program (attached) to open 100 files in that directory incrementing an initial number by a speficied number.
Run that against a 5000 file directory as baseline:

GEIN $ mcr sys$login:tmp 50
ELAPSED: 0 00:00:02.10 CPU: 0:00:00.12 BUFIO: 302 DIRIO: 31
ELAPSED: 0 00:00:02.03 CPU: 0:00:00.12 BUFIO: 302 DIRIO: 14

Next 24000

GEIN $ mcr sys$login:tmp 240
ELAPSED: 0 00:00:02.17 CPU: 0:00:00.07 BUFIO: 302 DIRIO: 67
GEIN $ mcr sys$login:tmp 230
ELAPSED: 0 00:00:02.29 CPU: 0:00:00.25 BUFIO: 302 DIRIO: 64

no big changes

Now for 28000. ( 6 * 240 * 19 (files per dir block) ))
GEIN $ mcr sys$login:tmp 280
ELAPSED: 0 00:00:31.24 CPU: 0:00:00.99 BUFIO: 302 DIRIO: 1854
ELAPSED: 0 00:00:39.81 CPU: 0:00:00.98 BUFIO: 302 DIRIO: 2017

Ah ... more than 10 times slower!


Mind you, this problem is fixed in 8.3

System: GEIN, AlphaServer DS10L 466 MHz

For the curious few, programs and details such as a directory index block from 8.3 in attachement.

Hope this helps,
Hein.
Hein van den Heuvel
Honored Contributor

Re: ZIP performance

Wrong attachment. Trying again. Hein.
Steve Reece_3
Trusted Contributor

Re: ZIP performance

Without knowing much about the application it's difficult to see how this can go any better. You're creating lots of itsy bitsy files in one directory. With hardware like the AlphaServer 800 5/500 it's going to be highly painful. Your hardware in terms of bus speeds and IO controllers is likely to be limiting the speed of the zip, even without impact from the rotating media and the software/cache.

You don't indicate how long you need to keep files for or on what frequency/volume new files are created. What's the growth per day/week/month in storage requirements? Do you have loads of space available? I hope the files are on a disk other than the system disk too but you don't mention.

There was also a bug that I'm aware of that when the directory file gets within a few hundred blocks of 32767 blocks the directory file structure apparently becomes corrupt or otherwise in accessible. That's painful to get around too as you need to do ANA/DISK/REPAIR passes to move files into [SYSLOST] then quit part way through so that you don't corrupt [SYSLOST] as well...!

If you have the space and the application is well enough behaved, I'd be tempted to backup the directory/ies that your invoice files are in and anything else on that disk, reinitialize the volume with more reasonable characteristics, recreate everything else on the volume (directory structures, ownership, other files) and, if necessary, the last invoice file so that the application knows where to go, then just leave the files to grow - possibly using a search list and several directories so that you can swap new directories in occasionally and giving a method for binning off old invoices if you can.
You can "back fill" the invoice files from tape if you really want them back on disk again if you have the last invoice file already restored or if the application knows the name of the file it needs to write next.
Steve
Hein van den Heuvel
Honored Contributor

Re: ZIP performance

>> Without knowing much about the application it's difficult to see how this can go any better.

You might not have read my reply before you wrote this.

Of course this can be done better.
This was a poor implementation by OpenVMS, and it was fixed. Or rather: mitigated. A real fix required b-tree directories or some such.

60,000 files at 11 used blocks is just 322 MB, and they with in a 1.5MB directory.
Give each file 1 full disk rotation and 1 max seek on an old disk, and round that up to a generous 50 millisecond and you still get 20 files/second = 3000 seconds. In reality it will be better than 15 ms, for 60 files/second or 20 minutes to read the whole lot. Double that if you insist to account for the required file header IO. Note: the files are contiguous, as they fit in a single cluster.

Also, Art mentions 'hours' before data started to be used, and 11 hours CPU time, later refined to 60% kernel, 10% interrupt. Even if ZIP goes to the file headers for some attributes, that is an unreasonable amount suggestions a performance aberration in the lookups.

>> You're creating lots of itsy bitsy files in one directory. With hardware like the AlphaServer 800 5/500 it's going to be highly painful.

Beg to differ. It's the specific use of software to blame. If those files were spread even over just 10 directories you would neve have heard about ot.


>> There was also a bug that I'm aware of that when the directory file gets within a few hundred blocks of 32767 blocks the directory file structure apparently becomes corrupt or otherwise in accessible.

Correct. Cause a BADIRECTORY error on a good day, a crash on bad days. I actually found and fixed that 2-sep-2004.. 5 year to the day. Ken Blaylock released the fix for 7.3-1 and onwards. The SHFDIR code did a simple 16 bit operation on what in reality was a complex 32 bit fields (EBK and HBK are word swapped for historical (hysterical?) reasons.

At that time I wrote in a note "In testing I was also somewhat surprised to see the (7.1) system degrade signifincatnly adding entries in order to the end of a directory. I had kinda expected that adding to the end of a pre-allocated directory wouldn't hurt too much. It did."

Now we know why. My simple test to load a directory up to 65K blocks into a (DFU) pre-allocated directory never finished: Here is the START of that output.

$ @DIR_TEST
1000000 size: 1 time: 0
1001000 size: 200 time: 1399
1002000 size: 400 time: 1412
:
1007000 size: 1400 time: 1570
:
1027000 size: 5400 time: 20753
1028000 size: 5600 time: 21334
1029000 size: 5800 time: 22016

[ To test this expediently I used DCL loop to read an (oversized with many versions) directory record and append it over and over tweaking the name to be new and in order to file the bulk of the file. Proper set file/enter for the tail.]

>> - possibly using a search list and several directories so that you can swap new directories in occasionally and giving a method for binning off old invoices if you can.

Yes, that would work. But I get the impression that would have to happen during the SINGLE? run that created the files?


Btw.. the "Directory Index Cache" is documented in Kirby McCoy's little black bible "VMS File System Internal". It even explains "A size of 15 bytes was picked because MAIL$800... files that are about a day apart in creation time very in the fourteenth of fifteenth character."


Art>> Steven, I hope I have provided enough details to help your psychic powers along.

I missed that on early reading. Nice! The man knows his audience. :-)

It would be somewhat interesting to know what 'zip' does in this 'scanning' phase, but not interesting enough to pick up and open up the sources (because I know I would spend way too much time in there).

Also, a wildcard operation instead of a list would possibly help here, as that triggers RMS to use its own directory cache/scan


Cheers all!
Hein.