File sizing question

Willem Grooters · ‎12-28-2004

Consider I have a file, record size of 200 bytes. I need to expand the record size to 256 bytes and a secondary key of 18 bytes is added.
A simple CONVERT is not sufficient because of changes in the record layout, so I will use a (delivered) conversion program. This program will use a (also delivered) FDL-file to create the new file after is has renamed the old one. It will then populate the new file with the converted contents of the old one.

Conversion of a file takes a long time due to the number of records.

In order to speed up this process, I think that the new file will need to be created using sufficient allocation for each area, so I'll need to edit the allocation statements in the FDL-file.

Is there some 'rule of thumb' on the required increase? For example: just an expansion of 56 bytes on a 200-byte record would mean an increase of just over 25% per record. Would a same increase in allocation be sufficient in data area's?
What about added secondary keys, and keys changed in size?
If a key is segmented, will that constitue a penalty?

Just a 'rule of thumb' is sufficient, since this conversion will be done on different files in different environments.

Willem

Willem Grooters
OpenVMS Developer & System Manager

Robert Gezelter · ‎12-28-2004

Willem,

Presuming that you are not using compression, and the records are fixed length (not up to 200, growing to a fixed length of 256) the growth factor of approximately 25% should be good for a straight conversion. Without going into depth, I would be more concerned about the bucket size and whether you will start creating bucket splits.

Really, a careful check of the FDL is probably called for.

Indices are a different matter. Perhaps Hein can weigh in, but my offhand recollection is that there are enough variables to make any simple calculation a question mark.

- Bob Gezelter, http://www.rlgsc.com

Jan van den Ende · ‎12-28-2004

Willem,

I would think that the actual fill percentage of the old file, and the desired one of the new, are of much more importance.

Once operational, will the file (according to the main key) be added to at the end, or inserted-into more-or-less random?

In the first case, fill factor 100% is appropriate, in the second case:
make a plan for periodic re-conversion.
(example: once a year) Then estimate the expected growth in that period, and calculate the size you would need for that. Take a few % extra. Adjust your fill factor to now spread evenly over the resulting size.

For current expansion guesstimate:
Take the record length and add to that the lengths of all keys. Do this for old and new record descriptions. The factor between those will be the best guess I can give.

If Hein got different views: believe him rather than me!

Tot donderdag.

Proost.

Have one on me.

Seasonal greetings to all!

jpe

Don't rust yours pelled jacker to fine doll missed aches.

Hein van den Heuvel · ‎12-28-2004

It sounds like the conversion uses straight $PUTs to the target indexed file. That will cost many IOs per record. Any file growth overhead will be mininal, with a reasonable extend (10,000 ? 50,000?). I woudl recommend focussing on optimizing the process itself, not the extend. Later, with the conversion done, then re-convert with appropriate initial area sizes and good extends.

- 25% data record size will match closely to 25% increase in buckets used except for the silly (!$) 1 and 2 block bucket sizes where it can mean the difference between say only 3 records fitting where 4 used to fit.
- Data record compression, which you really should have, will further minimize the relative growth as the overhead does not change, but the compressable data increases.
- secondary keys: let edit/fdl calculate. the simple approach is to take the key size + 7 bytes times the number of records for MB used.
- segmentation has no on-disk cost at all.

conversion advice

- do a test convert on 10,000 record or so. Then you can see how big the various areas become, and how effective the compression is.
perl -pe "last if ($.>10000)" < old.dat > test.dat

- see if you can use a SEQUENTIAL output file for the transformation, then convert/fast/nosort/stat/fdl=.. for the new files.

- if you must use an indexed file, see if you can just use the primary key during the transfortmation, then convert to multy-key

- if you must use the target file with multiple keys, be sure to add lots of buffers for the transformation. SET RMS/IND/BUF=255 (or a few thousand global buffers)

- if you can change the source, be sure to use DEFERRED WRITE with many buffers during the transformation. For a typicall batch job transfor this can be a 10x difference, as buckets are only written when a new buffer is needed. So with a typical 10 to 20 data records / data bucket that avoids many IOs.
If the (new) alternate key is somewhat in order with the primary key, then the savings there are significant as well, otherwise you will at least save in the index maintenance.

For further help, be sure to provide further details. 1,000,000 records or 100,000,000? attach an old ANA/RMS/FDL with stats and a new FDL? Stats on a 10,000 record convert?

Met vriendelijke groetjes, en beste wensen voor het nieuwe jaar...

Hein.

Willem Grooters · ‎12-28-2004

Thanks so far.
Hein, as usual, you gave enough hints for further investigation, but I cannot use your suggestions in creating a sequential file, of just use the primary only.

The problem point is that most recent data, is to be converted first in order to interrupt normal procesing as short as possible. The rest is done - part by part - concurrent with normal processing. If this takes hours, I wouldn't mind.

A full, one-run conversion of one file, containing just of 3 million records, took 6,5 hours, and the total time used to convert the same amount part-by-part in parallel took 9 hours in total.

I found one reason: bucket size: the original file had a bucket size of 35 on each eare and the new file a bucket size of 3....

Willem

Willem Grooters
OpenVMS Developer & System Manager

Robert Gezelter · ‎12-28-2004

Willem,

A thought. You may be able to take Hein's suggestions about a sequential file by reversing the sequence of conversion steps, to wit:
- unload the historical data to a sequential file
- convert the historical sequential file to the new format, as a sequential file
- perform a mass convert of the resulting file to the new index file
- do the indexed update of the recent information (this is the step that is time critical)

Of course, for maximum efficiency, make sure that:
- when doing file conversions your RMS buffering parameters are set high (the performance effect can be very significant)
- that the work files used by CONVERT are on a separate, fairly idle disk
- the source and destination files should be on different disks.
- since you are converting historical data, you can run the mass conversion at a time of low system load.

In short, you may be able to do this very efficiently and quickly, and not forgo any of the advantages of a fairly well organized file (I presume that the historical data is far larger than the "recent" data). The mass convert will produce better indices than adding one record at a time.

I hope that the above is helpful.

- Bob Gezelter, http://www.rlgsc.com

Willem Grooters · ‎12-28-2004

The "solution" is procedural.
The converion is done in the weekend, directly after backup. Data not older than two weeks is converted first, different files in parellel - it's quite unlikely that older data is required at the time. After this first set, users are allowed to access this data.
This first conversion adds up to one hour that the application is not accessable - which is considered acceptable.

This has been used for years and the rsponsible people (application and sysem managers) are used to it. Now we face a loss of experience and knowlegde, especially in the system management area, so we prefer to stick to this procedure, in stead of adding extra steps - as long as it's possible.
We may face the situation that even a first conversion will take too much time so we have to think about solutions.

Also, conversions are a required normal procedure so the more we can speed them up, the better.

Willem

Willem Grooters
OpenVMS Developer & System Manager

Robert Gezelter · ‎12-28-2004

Willem,

My experiences are that for mass conversions and reformattings, it is often far faster to:
- unload all of the historical data to a sequential file (if one is talking about a 5-year history, and current records are considered to be the last month, assuming uniform distribution, 1/60 or approximately 1.6% of the records are current).
- reformat the sequential file (this is very IO bound, I often use VERY LARGE RMS blocking, buffering, and extend sizes during this stage, it improves system efficiency and wall time tremendously -- read that as I have seen factors of 10!).
- use CONVERT to build the full indexed file from the reformatted sequential file, leaving the buffer factors high
- add the "recent" data directly to the resulting indexed file (quite possibly less than the hour you mentioned in your last posting).

This should be far faster than running multiple RMS indexed streams processing the file and reformatting records individually. It also produces far more efficient and well organized indices and buckets. Since 98.4% of the file was loaded en-masse, only a small percentage of records will have the potential of splitting buckets or producing other performance losses.

- Bob Gezelter, http://www.rlgsc.com

Willem Grooters · ‎12-28-2004

Fully agreed Bob. If I have the chance of revise the conversion methodology, I would certainly take that in account. But that's long-term - and the procedures need to be completely fool-proof.

Willem

Willem Grooters
OpenVMS Developer & System Manager

Robert Gezelter · ‎12-28-2004

Willem,

Agreed. That is the beauty of the "convert the history first" approach. The bulk of the conversion can be done in advance. If problems are encountered, the bulk conversion can be re-run.

My best wishes for a happy and healthy new year!

- Bob Gezelter, http://www.rlgsc.com

Hein van den Heuvel · ‎12-29-2004

>>> I found one reason: bucket size: the original file had a bucket size of 35 on each eare and the new file a bucket size of 3....

Ah! That'll get you. Could that have been caused by and edit/fdl/noint on input data from a small test file? If so, then you just want to change the record count in the input and re-run.

35 is a somewhat suspect number also. It could be just rigth of course. It could also be an 'accident' due to a cluster size problem with edit/fdl/nointer. It could have been based on old VCC IO size cut-off. Worth reviewing!

Thanks for the follow up though. Much clearer requirement now. It brought out that 'history first' suggestion.
I like that a lot, and you may want to try to put a safe and sound procedure in place with that.

The reason I like it a lot is because it will
- creates a much cleaner file
- give you a much more confidence that the final file will be right, that it will even fit!
- has the potential of reducing the down time further.
- reduces the total resources significantly.

My cut at Bob's approach:

- make the conversion program read the original file sequentially (with RRL+NLK or better still the new NQL option), no need for extra buffers as indexed files sequential gets just read a bucket at a time.
No need to move teh record from teh indexed file to a sequential file first.
- make the conversion program output to a quential file with large (120 or 96 or so) block size, WBH and 2 buffers.
No, scratch that.
Make it write to an unshared or with DFW to an indexed file with bucket size 63, all the compression and just a primary key on a temp disk. Just a few (10 - 20 ) buffers will do.
- convert the transformed file to the target location.

- create a modified transform program that can compare the original with existing file.
o read old, transform
o compare transformed with corresponding record in new file.
oo if equal (99%) NEXT record
oo if non-existant, $PUT
oo if different $UPDATE

- the comparison will be fast, as both file are in the same order.
- updating is generally easier than creating a fresh record.
- inserting will be easier as teh index structures will be 99% in place.
- inserting is LIKELY at the 'far right', which uis easy.

In the 'new data first' plan, with back-filling the old those inserts will be out of order causing bucket splits all the time, in aereas of the file where dense load is preferred.

finally.... as a general observation, not for the purpose of this exercise....
Would it not be great if some applications could partition their data into an old and new range. Either with a fix cutoff (date, item-number) of a dynamic (look in new first, if not found look in old). A little bit liek VMS searchlist works: create new files in the first directory, but look for older files in all directories from the list.
Doing so, you could 'freeze' the bulk of your data (90%? 99%?). That chunk would not need to be backed up any more, nor converted,
or at a much lower rate. (AI-journalling could further reduce the need for re-backup).
The new data, which needs more frequent backups and converts would be much smaller, easier to deal with.

Cheers,
Hein.

Robert Gezelter · ‎12-29-2004

Hein,

A footnote on your recent comment. I have seen cases, particularly in cases such as this, which are extremely IO bound, where increasing BUFFERCOUNT and BUFFERSIZE substantially significantly improves throughput, hence my recommendation.

Even using files with large extends, I have seen this effect.

- Bob Gezelter, http://www.rlgsc.com

Willem Grooters · ‎08-28-2005

Suggestions taken:
* Create FDL of file, take a good look to area sizing.
* Update sizing parameters of the new file according these numbers
* Convert/FDL of file.

This did the trick. Convert could be measured in minutes in stead of hours.

The recommendation has been added to the manuals.

Willem Grooters
OpenVMS Developer & System Manager

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

File sizing question

File sizing question

Re: File sizing question

Re: File sizing question

Re: File sizing question

Re: File sizing question

Re: File sizing question

Re: File sizing question

Re: File sizing question

Re: File sizing question

Re: File sizing question

Re: File sizing question

Re: File sizing question

Re: File sizing question