HPE Community read-only access December 15, 2018
This is a maintenance upgrade. You will be able to read articles and posts, but not post or reply.
Hours:
Dec 15, 4:00 am to 10:00 am UTC
Dec 14, 10:00 pm CST to Dec 15, 4:00 am CST
Dec 14, 8:00 pm PST to Dec 15, 2:00 am PST
Operating System - OpenVMS
cancel
Showing results for 
Search instead for 
Did you mean: 

Obtaining the key of a record about half-way thru a file

 
HDS
Frequent Advisor

Obtaining the key of a record about half-way thru a file

Hello.

This may seem a bit unusual, but I would guess that there have been many such items posted. I have an indexed file, two keys, each segmented, with anywhere from [say] 10K records to 30M records. RMS data record compression is enabled.

Is there a way to identify the key of the record that sits at or about the halfway point of the file?

I can guesstimate the approx record length with the compression and the overhead for the keys and can, by the EOF block, come within 10% of the number of records. I am looking for the key of the record that sits at 50% of that record number. For example, if I could open an indexed file as a direct-access sequential file, I could access by record number. I just can't seem to be able to do that (using Fortran).

Any ideas?

Much obliged, in advance.

-Howard-
14 REPLIES
Hein van den Heuvel
Honored Contributor

Re: Obtaining the key of a record about half-way thru a file

Unless your file has recently been converted, or always loaded in ever increasing primary key order, wheil fully pre-allocated you can NOT use the block numbers.

IF (big IF) the number of record are speead over the primary key range and somewhat predictably so, ten your best bet is a binary search or outrigth GET KGT "max-min/2"

Do you need this mid-way point once or repeatedly?

For just once I would use ANAL/RMS/INT
DOWN
DOWN
DOWN KEY
DOWN INDEX
DOWN (rootbucket header)
By carefully reading the bucket header data you can find how many entries there are in the root bucket (bucket-size minus VBN Free Space Offset divided by Bucket Pointer Size).
DOWN (first lower lever index pointer)
NEXT (entries/2)

Now... with index compression it'll be tricky to decode the key value.
Do just do a
DOWN (middle index bucket header)
DOWN firs key value... not compressed!


In a program you could use:

Create a file KEY_0.FDL like
FILE; ORG SEQ
RECORD; FORMAT FIXED; SIZE xxx

That xxx would be the TOTAL KEY SIZE for the primary key.

Now
$CONV/TRUN/PAD/STAT/FDL=KEY_0 KEY_0.SEQ
$
$OPEN/READ keys KEY_0.SEQ
$middle="1234"
$middle[0,32] =
$READ/KEY=&middle keys middle_key
$SHOW SYMB middle_key
$CLOSE keys

The exact needs and feeds woudl of course define the optima solution.

It may involve walking the primary key index at level-1, through a special program bypassing RMS.

Hope this helps some,
Hein van den Heuvel (at gmail dot com)
HvdH Performance Consulting


Hein van den Heuvel
Honored Contributor

Re: Obtaining the key of a record about half-way thru a file

Ouch, bad spelling my first reply.
The biggest problem with the extract keys and divide that file by half is the suggestion of using the TKS, assuming the primary key starts at byte 0, which it does for 9-out-of-10 indexed files, but it does not have to!

It was indicated the key is segmented.
So the size needed really the last byte of the segment with the highest start position.

This may be a good chunk of the record.
In that case just use the whole record ?!


Please tell us which problem you are trying to solve, and why?
How often?
What are the data volumes involved?
Partitioning an overly large file?
Purging of old data?

Would it not be better to have a predictable, repeatable algortime?
Mayby something like all record before 1-1-2007 or customer 1,000,000 thru 2,000,000
or East and West.


Hein.


Dean McGorrill
Valued Contributor

Re: Obtaining the key of a record about half-way thru a file

hi Howard,
I suppose you could convert the indexed
file to what you want and let your fortran
at it. Hein's is an interesting approach.
Curious as to what you are trying to do
overall? -Dean
HDS
Frequent Advisor

Re: Obtaining the key of a record about half-way thru a file

Hello.

In brief, I have an indexed ledger with the primary key consisting of the first 20 bytes of the 940 byte record, segmented into a 14 byte account and 6 byte fund.

The quantity of data ranges from 10,000 records up to 30million or more. We wish to perform some processing driven by this file. In doing so, the larger versions of this file are taking a substantial amount of time to process. (It is compute time processing that is the issue here. It is not the IO or RMS performance....I am sure.) One way to do divide the processing amongst multiple 'threads' (not using DECthreads...long story), is to have one process handle the first half of the file, and another process handle the second half of the file....we have 4 processors on the box.

This is not the only way to do this, but it seemed simple enough as long as we were able to open an indexed file sequentially with a direct access (by record number). However, as we found that this cannot be done (or at least we couldn't figure it out), we figured that we'd try alternatives.

Hope that this sheds some light on the task.

Many thanks,
-H-
Dean McGorrill
Valued Contributor

Re: Obtaining the key of a record about half-way thru a file

ok I see,
well if the key is alphabetical, you
could do something like this.. (eg in dcl)

$ open/read xx copysysuaf.dat
$ read/key="M" xx x
$ sho sym x
X = "....MCGORRILL
$ read xx x
$ sho sym x
X = "....MCPHERSON

that would put you in the middle of the
alphabet, keep reading until eof. that
would presume an even distribution across
the alphabet. an idea - Dean
Hein van den Heuvel
Honored Contributor

Re: Obtaining the key of a record about half-way thru a file

Hmm, so the key in not really segmented as far as RMS is concerned, only from an application perspective.

Anyway... I would strongly suggest you consider letting history teach you the right break down.
You personally probably already know 'the bad boys' from the easy ones.
Now make sure the application knows!
It is not likely to totally change overnight is it? (and even if it did, no harm done).

What you do TODAY, without changing the application main algortime, is to create creates a lookaside list containing some main volume indicators while processing.
By company, customer, by fund, I would not know until knowing more about the file itself.
Let's say you can create a list of customer numbers and records processed. Now just sort those my number of records. For the processing let the streams pick work elements from the sorted list from big to small. So the first one kicked of will be the biggest on. The next stream picks the next. As a stream is done, it picks more, and smaller items untill all done.
For sake of locatity of reference you may want to tweak this by grouping into large, medium, small units and key customers sorted within those ranges.
Any run will produce the sort order for the next run. SMOP! Easy!

Detailed help, or suggestions on how to truly break up the file in similar sized ranges woudl seem to go beyond the scope of a quick hint in a public forum. Email me if need be.

Hope this helps some,
Hein van den Heuvel (at gmail dot com)
HvdH Performance Consulting
Hoff
Honored Contributor

Re: Obtaining the key of a record about half-way thru a file

This looks to be Amdahl's law applied. Just get stuff going in hunks in parallel, and you'll win...

Run a coordinating process that creates "n" mailboxes and "n" processes, and each of these created processes connects back into its mailbox and says "hi!" to the server. The server then tosses a starting record and a range of records, or a "done!" message that tells the created process to clean up and exit.

For the first process that queues its "hi", the coordinating process gives it record 1 and the first, say, 1000 records. The second gets 1001 and 1000, etc. Each process does a keyed get on the starting record (key equal or greater, assuming the key is not issued sequentially), and stops when it gets to a record above its specified upper limit.

Use a termination mailbox to catch run-time errors, should a created process tip over and exit unexpectedly.

Why get fancier than you need to be here splitting up the work, when you can brute-force the parallelism and do nearly as well. And when the code itself can adapt to the file and its contents. Tailor your 1000 record hunk to your run-time for the clients, so that the coordination overhead (which will be minimal with the mailboxes) doesn't win out over the run-time.

Stephen Hoffman
HoffmanLabs LLC
Hein van den Heuvel
Honored Contributor

Re: Obtaining the key of a record about half-way thru a file

>> This looks to be Amdahl's law applied. Just get stuff going in hunks in parallel, and you'll win...

Most likely. But a single process / thread does not need locking. For some application which do little per record processing themself, the cost of locking may be prohibitive. For others the locking overhead is minor. The suggestion earlier was that the per record processing is significant so parallelizing will likely help.

> Run a coordinating process that creates "n" mailboxes and "n" processes, and each of these created processes connects back into its mailbox and says "hi!" to the server.

Could even be more simple. Literally $TYPE or Convert a task list file into a mailbox. Each idle server grabs a message, processes the chunk of business data associated with it, and looks for the next tasks. Exit on EOF.

>> For the first process that queues its "hi", the coordinating process gives it record 1 and the first, say, 1000 records.

Ya but, the suggestion there is that it hard to recognize te size of ranges from the primary key.
Let's says the primary key is STATE + social security number within state. The first stream could start with AK, the next AL, AR, AZ,... Fortunately CA comes early in the alphabet, but probably Texas hits hard down stream having about as many folks (23M) as all the states that follow together. So you probably want to split texas, but what is the right place for that?!

To divy it up nicely, you would have to count'm which may be only 1% of the processing cost, but could be 50% of the total cost.. on a poorly organized file.

>> Why get fancier than you need to be here splitting up the work, when you can brute-force the parallelism and do nearly as well.

Because of contention and caching effectivness reasons? You want the data concurrently processing close, but not too close.

>> And when the code itself can adapt to the file and its contents.

And that was the core of my suggestion.
To establish a pattern while processing is probaly zero overhead. Use that for good guess on the next run.

Cheers, (almost Miller time here)
Hein.

http://factfinder.census.gov/servlet/GCTTable?_bm=y&-geo_id=01000US&-_box_head_nbr=GCT-T1&-ds_name=PEP_2006_EST&-_lang=en&-format=US-9&-_sse=on
Hoff
Honored Contributor

Re: Obtaining the key of a record about half-way thru a file

>>>Ya but, the suggestion there is that it hard to recognize te size of ranges from the primary key.<<<

Yes. Though isn't the goal here to keep the processors in the box busy. You might not end up with all the client processes exiting at nearly the same time, but again, is that so critical as long as the resulting run-time is less than the monolithic run-time.

Who cares if you split (following your example) Texas into one hunk, two hunks, or into hundred smaller hunks?

And over time (days and weeks), you can have the server coordinating the operation develop a better idea of the patterns (either heuristics or based on explicit input from the user); to build up run profiles, if you have particularly disparate processing runs. During one aggregate run (over minutes or hours), you can tell when and how fast the servers are finishing and can infer how dense the fill might be. From that, you can guess at the spans for successive sections as the client processes run.

HDS
Frequent Advisor

Re: Obtaining the key of a record about half-way thru a file

Good morning.

Hoping that all had a pleasant weekend.
I wish to thank all of those who replied. I am very appreciative to the wonderful suggestions and rather detailed responses. I am currently reviewing all and, after gathering my thoughts, will reply accordingly.

Much obliged.
-Howard-
Dean McGorrill
Valued Contributor

Re: Obtaining the key of a record about half-way thru a file

hi Howard,
curious as what you come up with, maybe give a few points to these fine gentleman. interesting problem anway

>Cheers, (almost Miller time here)
maybe it should be Hein-ekin time Hein :)
HDS
Frequent Advisor

Re: Obtaining the key of a record about half-way thru a file

Good morning.

I wish to graciously thank all of those who responded. I received some rather informative responses. I ended up using parts of some...and will likely use parts of others for other situations.This was a learning experience, to say the least.

In any case, here is what was decided for this specific case.

- Using either a $CONVERT or a $SORT/SPEC, create a file consisting of 20 byte records, where each record is the key to the records in the original large file. (I say "either" cecause I find that there are times that a $SORT can perform twice as fast as a $CONVERT in some cases....I am not sure why, and I am not sure if this is one of those cases.) For the most part, I should be able to get through 14M records in about 10 minutes; creating that flat 20byte record file.
- Using either the /STAT from the $SORT/$CONVERT or rough math taking the EOF of the resulting file and dividing by 20 and multiplying by 512, I can get a record count.
- Opening the flat file of keys as sequential with direct access, I grab the 20 byte record which identifies the key of the record half way through the original file.
- Using that as the cut-off in the original file, one processing thread will read from start of file to the record who has that identified key. The other processing thread will use that 'half-way-key' to do a greater-than keyed read and then read the original file to the EOF.

So far so good with this approach. The additional 10minutes to get the midway point is easily made up by the concurrent processing threads, so we are benefiting. I might end up tweaking this as I go along...time will tell.

Again, I wish to thank all those who responded. Much obliged.

-H-
Hein van den Heuvel
Honored Contributor

Re: Obtaining the key of a record about half-way thru a file


Hmmm, a straight convert to sequential file, fixed length with /truncate should be faster than a sort.

Anyway...

Over the weekend I managed to combine some half programs I had, to do what I described in an earlier reply: Use the RMS Index tree as a way to approximate chunks of a file.

I slightly over-engineered it. :-) :-)

You can tell it at which level to do the cut.
Level 1 being the most precise, slowest but stilla magnitude faster than reading the whole file.
And you can tell it how many cuts to take. And whether to return the key values in DCL symbols:

Usage example:

$ rms_key_samples -l=2 -s -n=4 x.x
0/4 vbn:202 key:%x00000104
1/4 vbn:27511 key:%x000264E2
2/4 vbn:54997 key:%x00028BCD
3/4 vbn:79687 key:%x0002AED6
$ show symb rms*
RMS_KEY_SAMPLE_0 = "%x00000104"
RMS_KEY_SAMPLE_1 = "%x000264E2"
RMS_KEY_SAMPLE_2 = "%x00028BCD"
RMS_KEY_SAMPLE_3 = "%x0002AED6"

Verbose, level-1, with debug print out:

$ rms_key_samples -d -n=4 x.x
* 25-JUN-2007 00:21:03.95 ALQ=102945 BKS=3 LVL=3 x.x
Level 3, First VBN Pointer = 784
Level 2, First VBN Pointer = 202
re-pack 2. record_count=2000 vbn=6076
re-pack 4. record_count=3999 vbn=12133
re-pack 8. record_count=7997 vbn=24235
re-pack 16. record_count=15993 vbn=48460
re-pack 32. record_count=31985 vbn=96937
Level 1 buckets = 286, Records = 33962
0/4 vbn:4 key:%x00000000
1/4 vbn:25711 key:%x000262E6
2/4 vbn:51484 key:%x0002876C
3/4 vbn:77179 key:%x0002ABD9

Tested with binary keys (above) string keys compressed and uncompressed.
Not all combo's tested though....

Still thinking about whether to drop the '0' key output as it might be confusing, notably if the level is not 1.
A straight sequential read get you to the first chunk.

Give it a whirl!

Send me an Email if you like it.

Hope this helps someone someday...
Hein van den Heuvel (at gmail dot com)
HvdH Performance Consulting
HDS
Frequent Advisor

Re: Obtaining the key of a record about half-way thru a file

Good morning.

Hein...Thank you so very much :)

I will give this a try. I might not be able to get to it until later in the week (I got hit with some priorities this weekend).

I will most certainly get back to you as soon as possible.

Again...many thanks.

-Howard-