Re: Overheads of large .DIR files?

John McL · ‎04-12-2010

Robert (8:24 GMT)

- separate directories each day?
Yes, that's the plan. Accessed by a logical name that a timer AST flips (scheduled 23:59.99 and waits 1 second, then redefines logical according to current day).

- search list across the directories?
Yes, definitely

- further subdirectories?
It would be a tough job to convince people of the necessity at this stage. We might see how things go with just the separate directories and keep this in reserve. Splitting them by node (i.e. different logicals on each node) should mean that if the PID was the first varying component of the file names then new files would be added at the end.

FYI, about a month ago the situation was that we put 4 different types of log files in one directory and had over 250,000 entries. A new slave was started because all other slaves were busy but it was often the case that before the new one was ready to do some work one of those others would become free and it would take the task. Some of the overheads at the startup of the slave are our doing but I figured that .DIR management was probably also an issue.

Various changes have drastically reduced the number of slave processes and the reduction in log files has been significant but to my thinking the size of the directory files is still an issue (creates, deletes, lookups on INDEXF.SYS information). And that's why I've asked these questions.

Redesigning the whole architecture is not an option so my solutions have to be fairly tight and preferably involve minimal changes.

John McL · ‎04-12-2010

Hein,

This XFC caching of the first few characters is interesting.

I seem to have two options:

(a) the idea from others that adding to the end of a directory is most efficient but when the current space (presumably allocated rather than used) is full, a larger number of contiguous must be found and the original .DIR copied into it.

(b) Your idea of preallocating a large directory then populating it with dummy entries that will help XFC performance if I also vary the first few characters of the file names.

Your approach looks interesting but I'm not sure that I could get it implemented here. I also have some questions re your approach:

- Are the dummy entries files that really exist and wouldn't you need special conditions to avoid deleting them?

- What happens to your system if the load suddenly surges and the nunber of log files jumps? Won't this potentially mean directory expansion and possible splits of the structure that you carefully put together?

John McL · ‎04-12-2010

Andy,

I suspect that disk size and cluster size is probably not an issue. We have big disks (220 million blocks) with plenty of space on most and cluster sizes either 32 or 64 blocks.

Logicals spread over disks? My current plans have a logical pointing to the current directory and it does so by pointing to a logical name for each day and it's this second level that points to a specific disk and directory (i.e. CUR_HTTP_LOGDIR -> HTTP_LOGDIR_TUESDAY -> disk & directory). The aim was to make it flexible and allow system managers to use whatever disks and directories they wanted to.

Hein van den Heuvel · ‎04-12-2010

>> (a) the idea from others that adding to the end of a directory is most efficient but when the current space (presumably allocated rather than used) is full, a larger number of contiguous must be found and the original .DIR copied into it.

a1) it is easy enough to pre-allocate.
a2) an occasional re-allocate (once per extend = minimum once per cluster) is much cheaper than a full shuffle every time some middle block is filled. But there is a price on cleanup.

(b) Your idea of preallocating a large directory then populating it with dummy entries
- Are the dummy entries files that really exist and wouldn't you need special conditions to avoid deleting them?

b1) I would certainly give them a special extension (.X ? .KEEP ?) and/or version number
b2) If there are deleted, well then there is no functional harm done. You could just rerun the seed tool just before further deletes.
b3) You could make them just dummy File-ID entries. SYS$ENTER will accept any number you like : 1,1,0 123,123,0 whatever. See sample tool below. But it is probably better to use $ SET FILE/ENTER=.X PLACE_HOLDER.DAT.
That file PLACE_HOLDER.X could be protected against delete and have a contents to explain its purpose.
b4) I would choose as short as possible seed names, just enough to force the right distribution as not to eat too much space in the 512 byte directory block. Rounding up to even size it free. (The name space is always rounded up to a word).

>> What happens to your system if the load suddenly surges and the nunber of log files jumps? Won't this potentially mean directory expansion and possible splits of the structure that you carefully put together?

Yes, but nothing will be worse than today. Just not as optimal as it perhaps could be.
Delete ($set file/remove) all the *.X; file name entries and re-seed ( re-seat ? :-).

fwiw,
Hein

$ type enter.c
/*
** enter.C create directory entry for a file ID.
**
** Have fun, Hein van den Heuvel, HP 6/4/2002
*/
#include ssdef
#include rms
#include stdio
#include string
#include stdlib

main(int argc, char *argv[])
{
int i, status, sys$parse(), sys$enter();
char *p, expanded_name[256], resultand_name[256];
struct FAB fab;
struct NAM nam;

if (argc < 5) {
printf ("Usage $%s [x] [x]

\n", argv[0]);
return 268435456;
} else {

fab = cc$rms_fab;
fab.fab$l_fop = FAB$M_NAM;
fab.fab$l_dna = ".DAT";
fab.fab$b_dns = strlen(fab.fab$l_dna);
fab.fab$l_fna = argv[4];
fab.fab$b_fns = strlen (argv[4]);
fab.fab$l_nam = &nam;

nam = cc$rms_nam;
nam.nam$b_nop = NAM$M_NOCONCEAL;
nam.nam$l_rsa = resultand_name;
nam.nam$b_rss = 255;
nam.nam$l_esa = expanded_name;
nam.nam$b_ess = 255;

status = sys$parse( &fab );
if (status & 1) {
i = atoi (argv[1]);
nam.nam$w_fid_num = (short) i;
nam.nam$b_fid_nmx = (unsigned char) (i >> 16);
nam.nam$w_fid_seq = (short) atoi ( argv[2] );
nam.nam$w_fid_rvn = (short) atoi ( argv[3] );
status = sys$enter ( &fab );
}
return status;
}
}

H.Becker · ‎04-13-2010

>>>
(a) the idea from others that adding to the end of a directory is most efficient but when the current space (presumably allocated rather than used) is full, a larger number of contiguous must be found and the original .DIR copied into it.
<<<
I wouldn't say "most", you should be avare of ...

Sorry ITRC's "Retain format(spacing)." is not what I think it should be, you have to look at the attached text file.

H.Becker · ‎04-13-2010

Forget my previous posting. At the moment I can't tell you why I ended up with partial filled blocks in the directory file. At least I can't produce it at will.

Sorry for the confusion.

Hoff · ‎04-13-2010

But more on topic here, have you looked at what's going on with these 18,000 log files a day, and what you can do to reduce or compress or otherwise post-process this stuff?

Whether it's a scanner looking for errors or using a zip to create a single daily file or a DECset DEC Test Manager DIFFERENCES-based comparison with previous (template) logs or using a customized print symbiont to stuff these logs into a zip archive as they arrive or stuffing all this stuff into a key-value database or otherwise, 18,000 separate log files a day (much less a quarter-million logs) just isn't a manageable quantity.

It'd be interesting to know how often folks actually ever look at any of these logs and why and how, and working backwards from there.

And Harmut, it's more than the reformatting. The whole of the ITRC interface itself is not what it can or should be. The StackOverflow implementation is vastly superior, for instance.

John McL · ‎04-13-2010

Hoff,

The "business" side of this business wants the log files. They would in fact like up 30 days worth to be available in case someone needs to check back through them for something. The agreed position is 7 days worth available individually and immediately, and 4 weeks worth in daily zips than can be unzipped if required.

You seem to forget that we VMS users are sometimes constrained by business requirements and what's already established practice on site. And getting established practices changed is rarely easy, as others posting to this VMS forum have indicated.

We work within whatever constraints are imposed on us. When you fail to understand this your comments are often of limited immediate value.

Hoff · ‎04-13-2010

Um, forget? No. Many folks and many users are good at incremental changes, and unfortunately less often expect, visualize or propose larger changes. It's one of the pitfalls of asking for user feedback.

Nor have I forgotten how it's possible for an organization or an individual to talk itself out of fixing bugs, and talk itself into accepting applications that are inadequate or ill-performing or ill-suited or, well, crufty and cranky.

And irrespective of the organization, your designs and your solutions and your advances are only as constrained as _you_ allow yourself to be.

Come up with a better way to solve this. Something which doesn't involve storing and managing a directory containing 18,000 files a day. Whether this is solved through incremental changes or more significant work, this design is not scaling, and it's only going to get more problematic for you.

Robert Gezelter · ‎04-13-2010

John,

I am somewhat surprised that only four weeks are being retained. I have seen quite justified legal requirements for far longer retention periods on archived logs, particularly in my work with e-Discovery issues.

I would actually suggest using subdirectory names with the Julian date as the subdirectory name, and a possible logical name to the actual weekday as a shorthand.

The details are not all obvious, but from my experience with similar problems I would expect that much of what I described could be implemented without the need to touch actual code. Care would be required however to prevent a naming issues from causing confusion at a later date.

- Bob Gezelter, http://www.rlgsc.com

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Overheads of large .DIR files?