Re: Overheads of large .DIR files?

H.Becker · ‎04-13-2010

Forget my previous posting. At the moment I can't tell you why I ended up with partial filled blocks in the directory file. At least I can't produce it at will.

Sorry for the confusion.

Hoff · ‎04-13-2010

But more on topic here, have you looked at what's going on with these 18,000 log files a day, and what you can do to reduce or compress or otherwise post-process this stuff?

Whether it's a scanner looking for errors or using a zip to create a single daily file or a DECset DEC Test Manager DIFFERENCES-based comparison with previous (template) logs or using a customized print symbiont to stuff these logs into a zip archive as they arrive or stuffing all this stuff into a key-value database or otherwise, 18,000 separate log files a day (much less a quarter-million logs) just isn't a manageable quantity.

It'd be interesting to know how often folks actually ever look at any of these logs and why and how, and working backwards from there.

And Harmut, it's more than the reformatting. The whole of the ITRC interface itself is not what it can or should be. The StackOverflow implementation is vastly superior, for instance.

John McL · ‎04-13-2010

Hoff,

The "business" side of this business wants the log files. They would in fact like up 30 days worth to be available in case someone needs to check back through them for something. The agreed position is 7 days worth available individually and immediately, and 4 weeks worth in daily zips than can be unzipped if required.

You seem to forget that we VMS users are sometimes constrained by business requirements and what's already established practice on site. And getting established practices changed is rarely easy, as others posting to this VMS forum have indicated.

We work within whatever constraints are imposed on us. When you fail to understand this your comments are often of limited immediate value.

Hoff · ‎04-13-2010

Um, forget? No. Many folks and many users are good at incremental changes, and unfortunately less often expect, visualize or propose larger changes. It's one of the pitfalls of asking for user feedback.

Nor have I forgotten how it's possible for an organization or an individual to talk itself out of fixing bugs, and talk itself into accepting applications that are inadequate or ill-performing or ill-suited or, well, crufty and cranky.

And irrespective of the organization, your designs and your solutions and your advances are only as constrained as _you_ allow yourself to be.

Come up with a better way to solve this. Something which doesn't involve storing and managing a directory containing 18,000 files a day. Whether this is solved through incremental changes or more significant work, this design is not scaling, and it's only going to get more problematic for you.

Robert Gezelter · ‎04-13-2010

John,

I am somewhat surprised that only four weeks are being retained. I have seen quite justified legal requirements for far longer retention periods on archived logs, particularly in my work with e-Discovery issues.

I would actually suggest using subdirectory names with the Julian date as the subdirectory name, and a possible logical name to the actual weekday as a shorthand.

The details are not all obvious, but from my experience with similar problems I would expect that much of what I described could be implemented without the need to touch actual code. Care would be required however to prevent a naming issues from causing confusion at a later date.

- Bob Gezelter, http://www.rlgsc.com

John McL · ‎04-13-2010

Hoff,

You are wasting my time.

You say "And irrespective of the organization, your designs and your solutions and your advances are only as constrained as _you_ allow yourself to be."

You don't seem to appreciate the difficulty of getting approval for major changes when multiple stakeholders are involved at different levels of management authority where some may have initiated the original situation and a quick fix is preferred because other things will have soon have priority.

Also "Come up with a better way to solve this. Something which doesn't involve storing and managing a directory containing 18,000 files a day. Whether this is solved through incremental changes or more significant work, this design is not scaling, and it's only going to get more problematic for you."

Don't tell me how the systems here should be run. The problem has appeared on test systems rather than production systems, which don't generate this volumne of log files and are not expected to.

I'm also sure that other people face similar problems. Would your answer to all of them be "redesign your application"?

John McL · ‎04-13-2010

Robert,

I take your point but don't worry. We have 4 weeks online and a heck of a lot more offline. The ZIP files have names that incorporate the timestamp for the day in question, besides which ZIP manages to preserve the VMS file attributes including the dates.

I've tried to condense the points made in the comments here into the attached document. (I hope the formatting is okay, it's a cut-and-paste from a Word doc because I figured not everyone could handle Word.)

I'd appreciate if you or anyone else can point out any serious errors or omissions. TIA.

P Muralidhar Kini · ‎04-13-2010

Hi John,

>> the idea from others that adding to the end of a directory is most efficient
Yes, Thats right. If good to have the application creates filenames such that
they would get added at the end of the directory. Directory will have files
sorted in ascending order. If files are created with names so that they would
get added at the end of the directory then XQP would have to spend less time
shuffling blocks in order to insert the new file in the directory.

The best case would be for the file to go exactly at the end of directory
in which case no shuffling would be required and just EOF update of the directory file would be enough to add a new entry.

Regards
Murali

Let There Be Rock - AC/DC

P Muralidhar Kini · ‎04-13-2010

Hi John,

Nice summary you have posted from the various suggestion posted here.
I had a look at the attachment and have the following comments -

>> 8. Directory lookups are managed by the XFC internal utility, which holds
>> a simple index based on file names and pointers to the block of the .DIR file
>> in which they can be found.

It should be XQP and not XFC.

XQP (eXtended QIO Processor) is the metadata cahce in VMS whose job is to
cache file header & diretory files.

XFC (eXtended File Cache) is the data cache in VMS whose job is to cache
file contents.

The directory index is created and managed by XQP itself.

>> 9. The XFC utility has a limited amount of space for these file names so
>> as the number of filename entries increases in the directory file it reduces
>> the number of characters that it holds for each filename (e.g. AAA:1, ABC:3).
>> The use of identical characters at the start of each filename will reduce
>> the effectiveness of this lookup - if all filenames are identical up to the
>> number of characters XFC is using then there's no option for XFC but to start
>> looking for a file from the beginning of the .DIR file.

Again its XQP thats involved here and not XFC.

The directory index is of size 1 block.

The search will not always be from the starting of the .DIR file always.

When there are large number of files whose filenames are similar in the first
few characters then, each entry in the XQP directory index will point to large
number of files in the directory. So you would use the index and go to a
particular point in a directory directly. But since the index points to large
number of files, XQP then has to do a linear search scanning though lot of files
in order to find the matching filename. This linear search is what takes lot of
time.

In general when we talk about file headers and directories, XFC is no where
involved, its the XQP which does all the work.

Regards
Murali

Let There Be Rock - AC/DC

Jon Pinkley · ‎04-13-2010

John,

As P Muralidhar Kini said, that was a nice summary.

Another comment:

>>5. [Expansion type 2] If the insertion of a new file
>>(or file version) would force the entire .DIR file to
>>exceed its allocation then the new size is
>>(the current size + disk cluster size) and VMS must
>>find a free and contiguous space of this new size,
>>then copy the current directory into it before taking
>>the normal action to insert the new entry

VMS will not always expand by just one cluster. It appears to be related to the current size of the file. So it isn't quite as bad as it could be.
See attachement for command file that produced the following:

$ @cause_directory_expansion ! tested on 8.3
CLUSTER = 8 Hex = 00000008 Octal = 00000000010
INITIAL_ALLOCATION = 8 Hex = 00000008 Octal = 00000000010
CNT = 35, Prev 8, New 16, delta 8
CNT = 68, Prev 16, New 24, delta 8
CNT = 103, Prev 24, New 40, delta 16
CNT = 171, Prev 40, New 64, delta 24
CNT = 269, Prev 64, New 96, delta 32
CNT = 399, Prev 96, New 144, delta 48
$

Jon

it depends

Jon Pinkley · ‎04-13-2010

The output isn't as self documenting as it should be. The CNT was the number so new long file names created, the Prev was previous directory allocation, New the new the expanded allocation, and delta, the diffeence (how big the expansion was).

Jon

it depends

John McL · ‎04-13-2010

Thanks Muralidhar and Jon.

I've updated the notes but I'll wait a while in case of further revisions before I post it here. (So long as I remember to do this before closing this thread!)

Jon, were all your filenames the same size? I ask because it looks like a 16 block directory file held 35 filenames and a 64 block directory file held 171 (which is more than 4 * 35). Maybe inconsistent increase is due to inconsistent file size??

Muralidhar, are you in VMS engineering? If so, are there any plans to restructure and improve directory files? (I appreciate theat you might be limited in what you can say.) I can see potential improvements for some types of lookups but nothing that can do everything (except perhaps using an indexed file and let RMS do the work).

Hein van den Heuvel · ‎04-13-2010

Fine summary John.

- The 'add to end' is very literal.
Any disturbance will cause a 50/50 split.
One dangling 'zzz' can cause up to 50% waste!

(This is much like inserting records by primary key sequence in RMS indexed files)

- The place holder thought also has potential for 'rotating names' for example t4-node-dd-mmm-yy or hhmm_xxxx or nnn_xxxx where nnn_ is a batch number that rotates back to 0 all the time. The place holde name should be as short as possible, as not to eat into the limited 512 bytes/block too much. (Not .keep, but .k :-)

John 1) Directory growth formula is : min (1/2 size, 100) rounded up to a cluster boundary of course. There is no 'default extend quantity honored for directories.

John 2) When doing tests like you did, I like using just names, not files. No wasted work, and much easier cleanup (just blow away directory if things escape)

For example, to test the effect of a 'zzzz' record I used:

$! --- tmp.com : load directorie entries ---
$ if p1.eqs."" then p1 = 1
$ if p2.eqs."" then p2 = p1
$ i = p1 + 0
$loop:
$ set file/enter=[.tmp]'f$fao("!3ZL_!30*x.!30*x",i)' sys$login:login.com
$ i = i + 1
$ if i.le.'p2' then goto loop
$ mcr sys$login:CHECK_DIRECTORY.EXE tmp.dir

check_directory is the program published above, and attached here with a boundary condition fixed. The result is (names trimmed...) :

$ @tmp 1 20
Block ## First-name
000001 06 001_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...
000002 06 007_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...
000003 06 013_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...
000004 01 019_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...

20 records in 4 blocks, minimum length 3.
$ set file/rem [.tmp]*.*;*
$ @tmp 999
000001 00 999_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...

1 records in 1 blocks, minimum length 1.
$ @tmp 1 20
000001 04 001_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...
000002 04 005_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...
000003 04 009_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...
000004 04 013_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...
000005 04 017_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...

Cheers,
Hein

John McL · ‎04-13-2010

Hein,

I'll get back to you regards this latest posting of yours but in the meantime ...

I ran your directory checker across a directory with filenames in the current structure (consistent 11 character header, then 4 digits - 0002 to 1999 - then PID) and was shocked to find that it wasn't until character 22 that the filenames were unique. Unfortunately there were 10152 files in the directory and the .DIR file was 262 blocks.

I've drafted a doc that describes a more efficient naming scheme and when I've thought about your latest posting here I might modify that draft.

H.Becker · ‎04-13-2010

>>>
- The 'add to end' is very literal.
Any disturbance will cause a 50/50 split.
One dangling 'zzz' can cause up to 50% waste!
<<<

Yeah, adding a new version to the last file in the directoyr is not 'adding to the end'.

Jon Pinkley · ‎04-14-2010

>>Jon, were all your filenames the same size? I ask because it
>>looks like a 16 block directory file held 35 filenames and a
>>64 block directory file held 171 (which is more than 4 * 35).
>>Maybe inconsistent increase is due to inconsistent file size??

No, they had a non-zero-filled sequences, and this also caused out of order sequences

Directory ROOT$USERS:[JON.ITRCTEST2]

THIS_IS_A_LONG_FILE_NAME_THAT_WILL.HAVE_MANY_VERSIONS0;1
0/0 14-APR-2010 05:12:15.23
THIS_IS_A_LONG_FILE_NAME_THAT_WILL.HAVE_MANY_VERSIONS1;1
0/0 14-APR-2010 05:12:15.24
THIS_IS_A_LONG_FILE_NAME_THAT_WILL.HAVE_MANY_VERSIONS10;1
0/0 14-APR-2010 05:12:15.32
THIS_IS_A_LONG_FILE_NAME_THAT_WILL.HAVE_MANY_VERSIONS100;1
0/0 14-APR-2010 05:12:16.19
THIS_IS_A_LONG_FILE_NAME_THAT_WILL.HAVE_MANY_VERSIONS101;1

Here's the output of a version that creates fixed length file names

$ @CAUSE_DIRECTORY_EXPANSION_FIXED_LEN.COM
CLUSTER = 8 Hex = 00000008 Octal = 00000000010
INITIAL_ALLOCATION = 8 Hex = 00000008 Octal = 00000000010
FIRST_FILE = "ROOT$USERS:[JON.ITRCTEST3]THIS_IS_A_LONG_FILE_NAME_THAT_WILL.HAVE_MANY_VERSIONS00000;1"
Files = 56, Prev dir ALQ 8, New dir ALQ 16, delta 8
Files = 112, Prev dir ALQ 16, New dir ALQ 24, delta 8
Files = 168, Prev dir ALQ 24, New dir ALQ 40, delta 16
Files = 280, Prev dir ALQ 40, New dir ALQ 64, delta 24
Files = 448, Prev dir ALQ 64, New dir ALQ 96, delta 32
Files = 672, Prev dir ALQ 96, New dir ALQ 144, delta 48
$

Directory ROOT$USERS:[JON.ITRCTEST3]

THIS_IS_A_LONG_FILE_NAME_THAT_WILL.HAVE_MANY_VERSIONS00000;1
0/0 14-APR-2010 05:04:52.39
THIS_IS_A_LONG_FILE_NAME_THAT_WILL.HAVE_MANY_VERSIONS00001;1
0/0 14-APR-2010 05:04:52.39
THIS_IS_A_LONG_FILE_NAME_THAT_WILL.HAVE_MANY_VERSIONS00002;1
0/0 14-APR-2010 05:04:52.39
THIS_IS_A_LONG_FILE_NAME_THAT_WILL.HAVE_MANY_VERSIONS00003;1
0/0 14-APR-2010 05:04:52.39
THIS_IS_A_LONG_FILE_NAME_THAT_WILL.HAVE_MANY_VERSIONS00004;1

Note it has a better fill factor (got 671 files before it expanded from 96 blocks to 144 blocks)

Jon

it depends

John McL · ‎04-15-2010

Hein,

You've twice mentioned a 50:50 split of blocks but (a) when is it done and (b) which blocks are split?

TIA.

Hein van den Heuvel · ‎04-15-2010

re: 50:50 split.

Take your target file name.
Imagine the XQP needed to look it up.
Using a binary search, aided by the lookup table, it find the block (VBN #) where it should be, if it is there.

Now for the purpose of an insert that is still the target block.
If there is space, that's where the insert will happen.

But if there is not enough space, then that VBN will be split, and all entries beyond it will be shuffled up a block ACP_MAX_READ (32) blocks at a time.
Read@EOF, write@@eof-32+1, Read@EOF-64, write+@eof-64+1... until insert point.
With the lock split, then new entry will be written to the low, or high, part of the chunk as dictated by its sorting order.

Except when the insert point is the last byte in the last (used) block. In that case the new entry goes to a fresh, new, last block.

As Hartmut re-enforced, the rule is very strict. Even a new version for the last file is not 'the end' and will trigger a split, not a fresh block.

Cheers from Times Sq NY, OpenVMS TUD 2010.
Hein

John McL · ‎04-15-2010

I found my own answer regards the 50:50 split ... the last half of the block to split is copied to the next block (after all of the other subsquent blocks have been copied forward). Only after the 50:50 split is the new entry inserted, which will be according to sequencing (and in either of the two blocks were previously one block).

Hein you've confused me. I always thought that it doesn't matter whether it was a new filename that wanted to expand the first block to over 512 bytes or just the inserting of a new version of a file. Is this what you are saying in different words?

P Muralidhar Kini · ‎04-15-2010

Hi John,

>> Muralidhar, are you in VMS engineering?
Yes. You guessed that right.

>> If so, are there any plans to restructure and improve directory
>> files?

Improving the overall File System performance is that is something
that is already present in the VMS wish list. But in this case we are
specifically talking about performance related to directory files
i.e.
- Time taken to lookup a entry in the directory files
- Time taken to update Directory file when files are
Created/Deleted/Renamed in that directory

I agree that there is a scope for improvement in this area and have made
a note of it. I will add this information as a sub topic under the main
topic of improving File System performance in the wish list.

Again its the VMS product management team that will decide as to which
items are selected from the wish list for implementation in the next VMS
release.

You can also drop in a suggestion note to the VMS product management
team about this performance improvement idea so that this topic gets some
more visibility.

Also,
It would be interesting to know the % improvement that you would get
once you implement some of the recommendations that you have summarized
in your notes.

Regards,
Murali

Let There Be Rock - AC/DC

P Muralidhar Kini · ‎04-15-2010

Hi John,

>> I found my own answer regards the 50:50 split
Yes, correct.

When XQP is trying to add a entry to a block and if it finds that the block has no
space to insert a new entry then XQP tries to expand the directory inplace.
i.e.
Lets say directory has blocks 1-10. EOF is at 10.
Lets say we want to add a file "c.txt" and it should go say in block 3.
If there is no space in block 3 to accommodate "c.txt" then XQP will expand the
directory inplace. i.e. blocks from 3 to 10 are moved from 4 to 11.
EOF will now be at 11.

Block 3 would be split as follows -

Block 3
+---------------+--------+-----------------+
| | record | |
+---------------+--------+-----------------+
P2 P1

(The formatting of the block is not coming properly. The retain format(spacing)
is not doing the formatting as i expected)

P1 is now at the record boundary just past the half way mark, and P2 is at the
previous record boundary.
- If the new entry precedes P2, XQP splits the block at P2 and the new entry
goes in to the former block.
- If the new entry is at P1 or later, XQP splits the block at P1 and the entry
goes in to the latter block.
- If the new entry is a new version of the record at P2 then XQP splits the block
at P1 (i.e. if after P2 there is atleast one more entry) unless the record is
already the last in its block, in which case XQP splits the block at P2
(i.e. after P2 there is no more entry in the block)
- If XQP discovers that P1 is at the end of the block and P2 is at the beginning
(i.e. the block contains one large record), XQP splits the record instead.

Splitting of block to accommodate a new entry would occurs in both cases
i.e.
- if the new entry being added is a new file and
- if the new entry being added is a new version of the existing file.

Regards,
Murali

Let There Be Rock - AC/DC

John McL · ‎04-15-2010

Muralidhar,

I used Jon Pikey's technique to create file entries in a directory, with Hein's use of SET FILE/ENTRY=xxx. I started by creating a 999 entries (122 block directory file) with unique 3-digit numbers at the start of each filename. Then I inserted file 0001...etc prior to file 001...etc and I found that
(a) the directory file expanded to 123 blocks as expected
(b) the first block of the directory in in its old form was split 50:50 across blocks 1 and 2 of the modified file
(c) the new file was correctly added prior to file 001... etc.

In other words the block split, but did so in the middle rather than according to where the new file entry would be located. (Had the split just after where the new file would be added all of block 1 would have been copied to the newly-released block 2 and only the new entry ended up in block 1.)

I'm not sure that this is consistent with what you've said above.

(I can see that sometimes splitting after a new file entry might have advantages but sometimes it wouldn't. I guess the problem is finding the best method for "most" situations.)

P Muralidhar Kini · ‎04-15-2010

Hi John,

>> In other words the block split, but did so in the middle rather than according
>> to where the new file entry would be located. (Had the split just after where
>> the new file would be added all of block 1 would have been copied to the
>> newly-released block 2 and only the new entry ended up in block 1.)
>> I'm not sure that this is consistent with what you've said above.

What I had mentioned was that, in the scenario you had given, the splitting
would always happen at the middle of the block and NOT at the place where
the new entry has to be inserted.

For your scenario, the following XQP logic would apply

For the block,
P1 is now at the record boundary just past the half way mark, and P2 is at
the previous record boundary.
- If the new entry precedes P2, XQP splits the block at P2 and the new entry
goes in to the former block.

In this case you are adding a file that would go at the start of the block
i.e. it precedes P2 and hence the block split occurs at P2 (i.e half way mark)
and the new entry is added at its appropriate place in that block.

Regards,
Murali

Let There Be Rock - AC/DC

P Muralidhar Kini · ‎04-15-2010

Hi John,

I think the formatting of the block picture that i had given before is leading
to the confusion.

In the block picture,
P1 -> Record boundary just past the halfway mark
P2 -> Previous record boundary

So a split at P2 would mean a split somewhere near the half way mark of the
block depending on how big the record at the half way mark is.

Regards,
Murali

Let There Be Rock - AC/DC

John McL · ‎04-18-2010

I've been testing various alternatives to see the elapsed time and while most changes have the effect that I expect - slower, faster - there's one odd finding.

I'm inserting 5000 file entries into a directory that ends up using 625 blocks.

When I start from the CREATE/DIRECTORY [.BIG_DIR] command the file size is 1 block of 64 allocated and there are 6 increases of allocated space (64, 64, 128, 128, 128, 128). Inserting the file entries with these increases takes 3 minutes 50 seconds.

When I start afresh, having deleted all file entries and BIG_DIR.DIR, and use CREATE/DIR/ALLOCATE=800 [.BIG_DIR] the loading of the file entries takes 7 seconds longer.

There's no new allocation of contiguous space and no copy of the current file, so can anyone explain why this should take more time?

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Overheads of large .DIR files?