Re: Obtaining the key of a record about half-way thru a file

HDS · ‎06-15-2007

Hello.

This may seem a bit unusual, but I would guess that there have been many such items posted. I have an indexed file, two keys, each segmented, with anywhere from [say] 10K records to 30M records. RMS data record compression is enabled.

Is there a way to identify the key of the record that sits at or about the halfway point of the file?

I can guesstimate the approx record length with the compression and the overhead for the keys and can, by the EOF block, come within 10% of the number of records. I am looking for the key of the record that sits at 50% of that record number. For example, if I could open an indexed file as a direct-access sequential file, I could access by record number. I just can't seem to be able to do that (using Fortran).

Any ideas?

Much obliged, in advance.

-Howard-

Hein van den Heuvel · ‎06-15-2007

Unless your file has recently been converted, or always loaded in ever increasing primary key order, wheil fully pre-allocated you can NOT use the block numbers.

IF (big IF) the number of record are speead over the primary key range and somewhat predictably so, ten your best bet is a binary search or outrigth GET KGT "max-min/2"

Do you need this mid-way point once or repeatedly?

For just once I would use ANAL/RMS/INT
DOWN
DOWN
DOWN KEY
DOWN INDEX
DOWN (rootbucket header)
By carefully reading the bucket header data you can find how many entries there are in the root bucket (bucket-size minus VBN Free Space Offset divided by Bucket Pointer Size).
DOWN (first lower lever index pointer)
NEXT (entries/2)

Now... with index compression it'll be tricky to decode the key value.
Do just do a
DOWN (middle index bucket header)
DOWN firs key value... not compressed!

In a program you could use:

Create a file KEY_0.FDL like
FILE; ORG SEQ
RECORD; FORMAT FIXED; SIZE xxx

That xxx would be the TOTAL KEY SIZE for the primary key.

Now
$CONV/TRUN/PAD/STAT/FDL=KEY_0 KEY_0.SEQ
$
$OPEN/READ keys KEY_0.SEQ
$middle="1234"
$middle[0,32] =
$READ/KEY=&middle keys middle_key
$SHOW SYMB middle_key
$CLOSE keys

The exact needs and feeds woudl of course define the optima solution.

It may involve walking the primary key index at level-1, through a special program bypassing RMS.

Hope this helps some,
Hein van den Heuvel (at gmail dot com)
HvdH Performance Consulting

Hein van den Heuvel · ‎06-15-2007

Ouch, bad spelling my first reply.
The biggest problem with the extract keys and divide that file by half is the suggestion of using the TKS, assuming the primary key starts at byte 0, which it does for 9-out-of-10 indexed files, but it does not have to!

It was indicated the key is segmented.
So the size needed really the last byte of the segment with the highest start position.

This may be a good chunk of the record.
In that case just use the whole record ?!

Please tell us which problem you are trying to solve, and why?
How often?
What are the data volumes involved?
Partitioning an overly large file?
Purging of old data?

Would it not be better to have a predictable, repeatable algortime?
Mayby something like all record before 1-1-2007 or customer 1,000,000 thru 2,000,000
or East and West.

Hein.

Dean McGorrill · ‎06-15-2007

hi Howard,
I suppose you could convert the indexed
file to what you want and let your fortran
at it. Hein's is an interesting approach.
Curious as to what you are trying to do
overall? -Dean

HDS · ‎06-15-2007

Hello.

In brief, I have an indexed ledger with the primary key consisting of the first 20 bytes of the 940 byte record, segmented into a 14 byte account and 6 byte fund.

The quantity of data ranges from 10,000 records up to 30million or more. We wish to perform some processing driven by this file. In doing so, the larger versions of this file are taking a substantial amount of time to process. (It is compute time processing that is the issue here. It is not the IO or RMS performance....I am sure.) One way to do divide the processing amongst multiple 'threads' (not using DECthreads...long story), is to have one process handle the first half of the file, and another process handle the second half of the file....we have 4 processors on the box.

This is not the only way to do this, but it seemed simple enough as long as we were able to open an indexed file sequentially with a direct access (by record number). However, as we found that this cannot be done (or at least we couldn't figure it out), we figured that we'd try alternatives.

Hope that this sheds some light on the task.

Many thanks,
-H-

Dean McGorrill · ‎06-15-2007

ok I see,
well if the key is alphabetical, you
could do something like this.. (eg in dcl)

$ open/read xx copysysuaf.dat
$ read/key="M" xx x
$ sho sym x
X = "....MCGORRILL
$ read xx x
$ sho sym x
X = "....MCPHERSON

that would put you in the middle of the
alphabet, keep reading until eof. that
would presume an even distribution across
the alphabet. an idea - Dean

Hein van den Heuvel · ‎06-15-2007

Hmm, so the key in not really segmented as far as RMS is concerned, only from an application perspective.

Anyway... I would strongly suggest you consider letting history teach you the right break down.
You personally probably already know 'the bad boys' from the easy ones.
Now make sure the application knows!
It is not likely to totally change overnight is it? (and even if it did, no harm done).

What you do TODAY, without changing the application main algortime, is to create creates a lookaside list containing some main volume indicators while processing.
By company, customer, by fund, I would not know until knowing more about the file itself.
Let's say you can create a list of customer numbers and records processed. Now just sort those my number of records. For the processing let the streams pick work elements from the sorted list from big to small. So the first one kicked of will be the biggest on. The next stream picks the next. As a stream is done, it picks more, and smaller items untill all done.
For sake of locatity of reference you may want to tweak this by grouping into large, medium, small units and key customers sorted within those ranges.
Any run will produce the sort order for the next run. SMOP! Easy!

Detailed help, or suggestions on how to truly break up the file in similar sized ranges woudl seem to go beyond the scope of a quick hint in a public forum. Email me if need be.

Hope this helps some,
Hein van den Heuvel (at gmail dot com)
HvdH Performance Consulting

Hoff · ‎06-15-2007

This looks to be Amdahl's law applied. Just get stuff going in hunks in parallel, and you'll win...

Run a coordinating process that creates "n" mailboxes and "n" processes, and each of these created processes connects back into its mailbox and says "hi!" to the server. The server then tosses a starting record and a range of records, or a "done!" message that tells the created process to clean up and exit.

For the first process that queues its "hi", the coordinating process gives it record 1 and the first, say, 1000 records. The second gets 1001 and 1000, etc. Each process does a keyed get on the starting record (key equal or greater, assuming the key is not issued sequentially), and stops when it gets to a record above its specified upper limit.

Use a termination mailbox to catch run-time errors, should a created process tip over and exit unexpectedly.

Why get fancier than you need to be here splitting up the work, when you can brute-force the parallelism and do nearly as well. And when the code itself can adapt to the file and its contents. Tailor your 1000 record hunk to your run-time for the clients, so that the coordination overhead (which will be minimal with the mailboxes) doesn't win out over the run-time.

Stephen Hoffman
HoffmanLabs LLC

Hein van den Heuvel · ‎06-15-2007

>> This looks to be Amdahl's law applied. Just get stuff going in hunks in parallel, and you'll win...

Most likely. But a single process / thread does not need locking. For some application which do little per record processing themself, the cost of locking may be prohibitive. For others the locking overhead is minor. The suggestion earlier was that the per record processing is significant so parallelizing will likely help.

> Run a coordinating process that creates "n" mailboxes and "n" processes, and each of these created processes connects back into its mailbox and says "hi!" to the server.

Could even be more simple. Literally $TYPE or Convert a task list file into a mailbox. Each idle server grabs a message, processes the chunk of business data associated with it, and looks for the next tasks. Exit on EOF.

>> For the first process that queues its "hi", the coordinating process gives it record 1 and the first, say, 1000 records.

Ya but, the suggestion there is that it hard to recognize te size of ranges from the primary key.
Let's says the primary key is STATE + social security number within state. The first stream could start with AK, the next AL, AR, AZ,... Fortunately CA comes early in the alphabet, but probably Texas hits hard down stream having about as many folks (23M) as all the states that follow together. So you probably want to split texas, but what is the right place for that?!

To divy it up nicely, you would have to count'm which may be only 1% of the processing cost, but could be 50% of the total cost.. on a poorly organized file.

>> Why get fancier than you need to be here splitting up the work, when you can brute-force the parallelism and do nearly as well.

Because of contention and caching effectivness reasons? You want the data concurrently processing close, but not too close.

>> And when the code itself can adapt to the file and its contents.

And that was the core of my suggestion.
To establish a pattern while processing is probaly zero overhead. Use that for good guess on the next run.

Cheers, (almost Miller time here)
Hein.

http://factfinder.census.gov/servlet/GCTTable?_bm=y&-geo_id=01000US&-_box_head_nbr=GCT-T1&-ds_name=PEP_2006_EST&-_lang=en&-format=US-9&-_sse=on

Hoff · ‎06-15-2007

>>>Ya but, the suggestion there is that it hard to recognize te size of ranges from the primary key.<<<

Yes. Though isn't the goal here to keep the processors in the box busy. You might not end up with all the client processes exiting at nearly the same time, but again, is that so critical as long as the resulting run-time is less than the monolithic run-time.

Who cares if you split (following your example) Texas into one hunk, two hunks, or into hundred smaller hunks?

And over time (days and weeks), you can have the server coordinating the operation develop a better idea of the patterns (either heuristics or based on explicit input from the user); to build up run profiles, if you have particularly disparate processing runs. During one aggregate run (over minutes or hours), you can tell when and how fast the servers are finishing and can infer how dense the fill might be. From that, you can guess at the spans for successive sections as the client processes run.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Obtaining the key of a record about half-way thru a file

Obtaining the key of a record about half-way thru a file