Operating System - OpenVMS
1753779 Members
7801 Online
108799 Solutions
New Discussion юеВ

What is the impact of accessing an extra file

 
SOLVED
Go to solution
Jan van den Ende
Honored Contributor

What is the impact of accessing an extra file

Anybody have an idea about the impact of accessing an extra file vs. gosub-ing?

I am to assist in converting a free-text-database. (Basis+).
It has a query language that functions more or less like SQL (but it IS quite another beast!)

The problem: the trial runs for the pilot extrapolate to MONTHS of exporting! (I'm told DUMPing and converting the output cannot generate data in a useable format because that looses important connections; by lack of info I just have to accept that).
I have been looking into the exporting code (.PRC file, comparable to .SQL files), and PER processed "record", there are anywhere from 10 to over 100 activations of other .PRC files (all-in-all 20 different ones).

If the number of records goes over 100000 for the pilot, I guess the activations might become relevant ( ?? ).
Anybody an educated guess whether I should tell them to consolidate the .PRC code into a single chunk, converting the separate-file activations to GOSUBs, or, will there be very little gain because caching will provide?
Formulated another way: is the overhead of switching to code read from a separate (in-cache) file & returning comparable to GOSUB & RETURN, or is it significantly higher?

tia,

Jan
Don't rust yours pelled jacker to fine doll missed aches.
17 REPLIES 17
Hein van den Heuvel
Honored Contributor

Re: What is the impact of accessing an extra file


I think you are right. 10 - 100 file opens per 'record', even from cache, would take excessive time. Probably it is just CPU time, but maybe there is a deaccess file header update on close?. I don't know anything about basis+, but like you said, maybe they can 'gosub' withing a script?
Maybe you can write a (perl) script to transform multiple nested scripts into a single inline on? But really what you need is a script that loops over the records, staying active. Does basis have a concept of 'stored procedures' perhaps?

Here is a silly thought... who 'deep down' are those included prc files. RMS will have a directory cache with names and (did) numbers to drill down quickly, and the xqp will have the filename to fid for the final lookup step in all likelyhood (I never really studied how that works) but it would seem to me that how deeper (nested directories, searchlist, size of leaf directory) the files is the longer it takes.
For the purpose of a long running convert, I might be tempted to experiment with moving those 10 - 100 frequently opened and closed files into an explicit $x$nnnn:[000000] (on a ram disk?) location with no logical names or searchlist involved. At least it would be an intersting experiment! Let us know!

Groetjes,
Hein.
Hein van den Heuvel
Honored Contributor

Re: What is the impact of accessing an extra file


I just did a quick (& incomplete) test opening closing and reading a file down in a 6th level subdirectory versus the same file in the root directory, and find no significant difference. With DCL it is less then 10%, with Perl no measurable differnce.
With DCL I see kernel/exec/super/user=57/23/19/0
With Perl that is 62/20/0/18 for master, and 64/20/0/16 for deep suggesting the master location is a little easier on the system: more user work is done.

Silly scripts below.
Hein.

--- perl ---

$f = "[.tmp.tmp.tmp.tmp.tmp.tmp]tmp-sub.com";
$f = "_dra0:[000000]a.com";
($sec,$min,$hour) = localtime(time);
$t -= $sec + 60 * ( $minute + 60 * $hour );
while ($i++ < 10000) {
open (X,"<$f") || die "($i) Failed to open $f";
while (){};
close (X);
}
($sec,$min,$hour) = localtime(time);
$t += $sec + 60 * ( $minute + 60 * $hour );
print "Elapsed time = $t seconds.\n"


---- dcl ----
$
$
$ old_cputim = F$GETJPI ("","CPUTIM")
$ old_dirio = F$GETJPI ("","DIRIO")
$ old_bufio = F$GETJPI ("","BUFIO")
$ old_time = F$TIME ()
$!
$! Run the test
$!
$i=10000
$loop:
$!@dra0:[000000]a.com
$@[.tmp.tmp.tmp.tmp.tmp.tmp]tmp-sub.com
$ i=i-1
$ if i .gt. 0 then goto loop
$!
$! Save the end time
$!
$ end_cputim = F$GETJPI ("","CPUTIM")
$ end_dirio = F$GETJPI ("","DIRIO")
$ end_bufio = F$GETJPI ("","BUFIO")
$ END_TIME = F$TIME()
$
$ old_elapsed = ((F$EXT(12,2,old_time) * 60 + F$EXT(15,2,old_time)) * 60 -
+ F$EXT(18,2,old_time)) * 100 + F$EXT(21,2,old_time)
$ end_elapsed = ((F$EXT(12,2,end_time) * 60 + F$EXT(15,2,end_time)) * 60 -
+ F$EXT(18,2,end_time)) * 100 + F$EXT(21,2,end_time)
$ elapsed = end_elapsed - old_elapsed
$ if elapsed .lt. 0 then elapsed = elapsed + 24*60*60*100
$ WRITE sys$output -
f$fao ( "Dirio=!6UL Bufio=!6UL Cpu=!6UL Elapsed=!6UL ", -
end_dirio - old_dirio, end_bufio - old_bufio, -
end_cputim - old_cputim, elapsed )
$exit


Uwe Zessin
Honored Contributor

Re: What is the impact of accessing an extra file

Hein,
you did run with XFC enabled, didn't you?
Did you flush any caches between tests?
.
Hein van den Heuvel
Honored Contributor

Re: What is the impact of accessing an extra file

Uwe, Like I indicated, this was not a serious test. Serious benchmarks are work!

It was actually an old AlphaServer 1000 4/266 with 7.1 setup with VIOC set to the default 3MB. No caches flushes, all files small and surely in memory. Uncontrolled, but minimal, other activity on the box.

You gave me an opportunity to add a line I forgot... the test did nothing (the dcl sub just did an exit, perl just read the exit line) and it showed only minimal (cpu time) differences. If the core code did actual work, then the file activity change would look even smaller, relatively speaking.

I just was curious to know whether it would show a difference at all! If Jan is truly interested in this pursuing this path it gives him an initial test to try to set a baseline/expectation. But it sounds like he does not need a little tweak / speedup, but he needs a different approach and is struggling with limiied basis+ skills.

Cheers,
Hein.
Jan van den Ende
Honored Contributor

Re: What is the impact of accessing an extra file

Hein,

yes, my Basis+ skills are not really of daily practice. :-(
Your measurements of the many activations though do point to the fact that the overhead is not that big, and would not really ask for the extra work of rewriting the procedure complex.

At our VMS SIG meeting this evening I did have some discussion about it with Willem Grooters, and we arrived (without the measurements, but with more background info) at about the same conclusion.

Some more background:
the database is about 4 Gb in about 20 files (1 per table, except the biggest table split over 2 files because Basis+ hardcoded limits file sizes to 2 G), and about 4 G index files, also 1 file per table.
System: Alpha 4100; 512 Mb; 2 SCSI Ultrawide buses w/ 1500rpm disks,
data files on 1 disk, index files on another, program files a 3rd, and output files 4th disk.
Machine temporarily allocated to this project, only system load being this job + some sysmgt people looking at it.
CPU load <10 %.
I/O load 50-60 (and we have seen this machine at 500 sustained, on completely unrelated work).
Disk I/O queue length average on most busy disk < 0.1
Willem and I concluded that the system is mainly waiting because of I/O latency.
If we would do the full export (not scheduled for pilot, but will have to be done) there are 5 (or 6?; can't check now from home) separate jobs. They probably could run in parallel with hardly any interference.

We even (conceptually) got a 'solution':
get a system with enough memory to get database + index (maybe as RAM-disk?) + dbms-workingsets in memory.
I do however have an idea about the budget holder's reaction...
On the other hand, on a loan basis, for a prestige project...?

I'll keep you posted, but results may take some time.

Jan
Don't rust yours pelled jacker to fine doll missed aches.
Hein van den Heuvel
Honored Contributor

Re: What is the impact of accessing an extra file


The CPU/IO usage observations are valueable inputs make progress. Like you said, you may be able to brute-force the issue by simply having multiple parallel streams. Always a good idea! Your first order of business should be to check whether the DB (Basis) has any form of async io settings/slaves.

50 - 60 Io/sec suggests synchroneous IO, non-write-back-cached writes or mostly random reads. You need to figure this out. What is the read/write mix?
Those disks, are they direct connect or behind a semi-intelligent controller?

If the problem is reads, then you need to figure out whether you can make a single read do more work (larger IO buffers), whether you can bring more order in the reads (pre-sort?), whether you can have some sort of read-ahaed (some controllers will do this).

If the problem is writes, then a good, relatively cheap solution might be to get a 'swxcr/kzpsc' or some such and put the disks behind it and enable writeback caching!

Heck, for the purpose of a (restartable) concersion it may even be acceptable to enable the scsi disk internal write-back caching albeit unprotected! Just flip the bits on the 'control pages' with a tool like rzdisk, or scu!

If it is random reads that is killing you, and you can recognize smaller datasets then indeed with a large (4GB or more) machine you may be able to suck them into a memory disk (in minutes) do the convert. stop db. move next chunk into memory, restart db, convert next chunk. Like you say, get someone to make an 8GB ES40 or ES45 available for the good cause for a while huh?!

Sterkte!
Hein.
Martin P.J. Zinser
Honored Contributor

Re: What is the impact of accessing an extra file

Hello Jan,

since I/O seems to be the problem one other avenue you might think about is throwing more disks at the problem. Writing seems not to be the problem, so I would have a look at the data files. Also since this seems to be RMS indexed files, have these been maintained before you started this project. As Hein will be able to tell you, this can make a huuuuuge difference ;-)

Greetings, Martin
Willem Grooters
Honored Contributor
Solution

Re: What is the impact of accessing an extra file

To explain the whole thing a bit from the programmer's persepctive ;-)

What I understood from Jan is that the program will read a record, analyze it and then conclude what to read next. This record, in turn, is processed the same way. This way, the whole 'tree' is traversed - branch by branch, leaf by leaf. Pretty recursive, formally seen. And if this is done sequentially, count the seconds...

I have already suggested that splitting up the process of extraction into several parallel streams (like suggested in this thread already), or use a completely different way of processing (distributed); both would require programming (OT: Jan, je weet me wel te vinden)

Something just come into my mind because of this: Check your (virtual) memory. I don't think this is very probelamtic but seen the way the program works, there could well be a problem in that area as well.

Willem
Willem Grooters
OpenVMS Developer & System Manager
Jan van den Ende
Honored Contributor

Re: What is the impact of accessing an extra file

Hein, Uwe:

Willems control flow is essentially correct:
Read a record, evaluate field(s) for entries in other table(s), read those, Like Willem wrote: travel every branch and every leave, and then move to the next 'tree'.
Travel tree in 2 - 3 minutes, about 100 000 trees in pilot 'forest', > 4 000 000 in whole 'country'.

Martin:
NO RMS-indexed files.
Basis+ is essentially for various Unix flavors, including VMS.
All files are sequntial-fixed length 2048.
ANY structure is defined & handled by db mgt
software.

btw, an omission in my previous stats list:
SHO MEM/CACH:
63 % read hit
1 % write hits
read count > 200 * write count

database journalling disabled

Hein:
disks directly connected.
teh way the database was designed (way back when) all database files are located in one directory (-logical name), same for index files.
-- come to think of it, maybe with search list can be spread over multiple disk?
Anyway, by the nature of the algorith I don't there is much gain to be expected.

btw, I did not yet explain the (really sad) reason for all this:

IF things are successfull, this will end with another major VMS application being replaced by a *NIX db / Billyware RSI-gegerating frontend applic.
And, judging the expence (money and effort) already invested, the 'political prestige' involved, and previous experiences, probably that what will become available will be DEFINED to be called successfull. In this past, it has already occurred that (to abuse an Asterix formulation): The opinions were devided. Management itself called it success, the rest of the community did not think so.
Anyway, we still try to be as helpful as possible,or in the end WE might get blamed for not being able to deliver the necessary starting info, and so causing delay/failure/inconsistent data/can you think of any other blame.

I sound sad?
Maybe all this is giving me a sad mood, but hay, it's not the end of the world!

I'll keep you informed.
Don't rust yours pelled jacker to fine doll missed aches.