Re: File/Record compression for direct access files

debu_17 · ‎08-31-2008

I generate a large number of direct access files wih large record sizes of 25000+ bytes/record.
I am looking for a suitable data compression module which can compress data (binary data),and which can be called from my existing c/fortran code.

Thanks in advance for any help.

John Gillings · ‎09-01-2008

debu,

If by "direct access" you mean RMS indexed files, why not use RMS record compression?

It's built into RMS and transparent to your code.

A crucible of informative mistakes

Joseph Huber_1 · ‎09-01-2008

If the files are index-sequential, then follow John advice.

Otherwise, in Fortran a file opened in direct access mode must be fixed length.
So compressing the records (to save space ?) is a contradiction to direct access.

To compress records You could use ZLIB routines , and write out to sequential, variable length records.
But of course You loose the direct access feature, i.e. the program logic handling the files has to be changed.

http://www.mpp.mpg.de/~huber

Joseph Huber_1 · ‎09-01-2008

Adding to my reply above:

If You want to convert the existing method of writing fixed length records with direct access,
then (in Fortran) the approach would be:
recordtype='VARIABLE'
organization = 'INDEXED'
access = 'KEYED'

Then use an integer at position 0 as the index = record number.
Then the rest of the record (starting after the key at position 4) would be the real content.
So the programs just have to change the direct access read/write into a keyed read/write using the (integer) record number as the key.

As far I remember, the default in Fortran for indexed files is "data record compression", so there is nothing else to do to compress.
If the RMS compression appears to be too weak, You can change the FDL to no data compression (via USEROPEN), and compress/decompress the data portion (not the leading index!) using zlib.
In both cases it is probably advisable to add a record length in front of the data portion.

http://www.mpp.mpg.de/~huber

Hoff · ‎09-01-2008

Suitability is something that can be best determined with some additional details; without that, we're just going to toss options at you. zlib, dcx, etc. Google will find various other options here.

For a better-targeted response to your particular question, please consider posting up some background around the application and problem statement, the range on the data size, applicable performance requirements, and the particular hardware involved. We might be able to find a faster way to do these 50? block writes for you.

And welcome to ITRC.

Hein van den Heuvel · ‎09-01-2008

I'll have to largely echo what has already been indictated, notably Joseph, but can provide some details and corrections.

1) Fortran direct access is mapped onto RMS Relative files which while they support varaible length records (needed for private compression algoritms) those are actually stored in fixed length 'cell' nixing the typical deliverables.

2) RMS Indexed file data compression MAY be a fine and simple to implement alternative, but its compression method is very limited. For data, all it does is replace repeating character sequences. This allow it to store 5 - 255 repeating bytes using 4 bytes. That may or might not help for the targetted data. Also note that indexed files add 11 - 13 bytes record overhead + 15 bytes bucket overhead versus 1 byte record overhead for relative files. Not too much given 25,000 byt records, but still...

3) RMS indexed files do allow for true variable length records, which may be all that is neeeded. But the BUCKETS holding those records will still need to be big enough to hald the largest possible records, or 50 blocks in your case. Compression does not help that unless you specificy maximimum-record-size = 0 and a bucket size to reflect the largest compressed record.

4A) What problem are you really trying to solve? Storage spaces needs? Memory usage? Access speeds?
4B) How will the data be used? Sequential, true direct access? Sharing?

hth,
Hein.

debu_17 · ‎09-02-2008

I am very much encouraged by the response and warm welcome to the forum.
I am very thankful to
John Gillings,Joseph Huber,Hoff,Hein for the very detailed answers/solutions.

Firstly I will intro myself in brief.

I work with OVMS for a Real time environment, for last several years.
The application is RT loss less reception and RT processing of continous satellite baseband and telemetry data.
hardware env. is DS-25 dual CPU, 4 GB ram, SCSi-ii disk pack.
Data acq. i/f is "DRQ-3B", DMA, ingest rate is 1Mbytes/sec conti.

The question i asked was for the application
(which is tightly coupled with the Rt acq. processes) which writes the "Data Archive".
The data is continous in nature, in what is called "format chunks".
The time available for this application to complete is about 200 msec. for a "chunk" . size is about 500kbytes.
I split the "chunk" to 20 records, of 25000bytes each(fixed),as this is the nearest i get to a disk block multiple of 512 bytes.

write operation is done using fortran OPEN
statement. using some logic the file is always "preopened" to avoid openeign the file when the actual fiel write starts.

the data archival req. per day is 60 GBytes.
bata is binary raw, 10 bit,( this i compact
to 30/32 bits, 3 words== 1 long word).

Now the issue for me is to reduce the write time to disk file, as i find more I/O bottleneck, rather than CPU availability in the system.
Also if i can compress the data further,
then the 60 Gb /day will come down,reducing the frequency of sevral other operations.

Here i agree with Josep H. that compressing will result in variable lenght records, as such is contrdictory to the present method of direct access file write with fixed rec. lenght and "block size" that i use.
but that operation i can modify by creating another step, of compressing the file records after -re-reading the archive file.
(and using the indexed/keyed access as suggested by J Huber)
I went thru the RMS section, but felt that
the "compress" has something more to do with the ASCII type of data.

In the meatime i am following up with the suggestions, that u all have given.
Thanks again.

Wim Van den Wyngaert · ‎09-02-2008

Did you try playing with set rms/block ?

Wim

Wim

Wim Van den Wyngaert · ‎09-02-2008

To avoid allocation requests, do you use INITIALIZE=size in the open (as you know the maximum size in advance, no need to ask more or less).

Wim

Wim

Wim Van den Wyngaert · ‎09-02-2008

Not clear if defered write is allowed. If you are allowed to lose some data on a crash, try BUFFERED=YES (at OPEN).

May be post your exact OPEN, WRITE etc statements ?

Wim

Wim

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: File/Record compression for direct access files

File/Record compression for direct access files