Operating System - OpenVMS
cancel
Showing results for 
Search instead for 
Did you mean: 

File/Record compression for direct access files

 
debu_17
Occasional Visitor

File/Record compression for direct access files

I generate a large number of direct access files wih large record sizes of 25000+ bytes/record.
I am looking for a suitable data compression module which can compress data (binary data),and which can be called from my existing c/fortran code.

Thanks in advance for any help.
17 REPLIES
John Gillings
Honored Contributor

Re: File/Record compression for direct access files

debu,

If by "direct access" you mean RMS indexed files, why not use RMS record compression?

It's built into RMS and transparent to your code.
A crucible of informative mistakes
Joseph Huber_1
Honored Contributor

Re: File/Record compression for direct access files

If the files are index-sequential, then follow John advice.

Otherwise, in Fortran a file opened in direct access mode must be fixed length.
So compressing the records (to save space ?) is a contradiction to direct access.

To compress records You could use ZLIB routines , and write out to sequential, variable length records.
But of course You loose the direct access feature, i.e. the program logic handling the files has to be changed.
http://www.mpp.mpg.de/~huber
Joseph Huber_1
Honored Contributor

Re: File/Record compression for direct access files

Adding to my reply above:

If You want to convert the existing method of writing fixed length records with direct access,
then (in Fortran) the approach would be:
recordtype='VARIABLE'
organization = 'INDEXED'
access = 'KEYED'

Then use an integer at position 0 as the index = record number.
Then the rest of the record (starting after the key at position 4) would be the real content.
So the programs just have to change the direct access read/write into a keyed read/write using the (integer) record number as the key.

As far I remember, the default in Fortran for indexed files is "data record compression", so there is nothing else to do to compress.
If the RMS compression appears to be too weak, You can change the FDL to no data compression (via USEROPEN), and compress/decompress the data portion (not the leading index!) using zlib.
In both cases it is probably advisable to add a record length in front of the data portion.
http://www.mpp.mpg.de/~huber
Hoff
Honored Contributor

Re: File/Record compression for direct access files

Suitability is something that can be best determined with some additional details; without that, we're just going to toss options at you. zlib, dcx, etc. Google will find various other options here.

For a better-targeted response to your particular question, please consider posting up some background around the application and problem statement, the range on the data size, applicable performance requirements, and the particular hardware involved. We might be able to find a faster way to do these 50? block writes for you.

And welcome to ITRC.

Hein van den Heuvel
Honored Contributor

Re: File/Record compression for direct access files

I'll have to largely echo what has already been indictated, notably Joseph, but can provide some details and corrections.

1) Fortran direct access is mapped onto RMS Relative files which while they support varaible length records (needed for private compression algoritms) those are actually stored in fixed length 'cell' nixing the typical deliverables.

2) RMS Indexed file data compression MAY be a fine and simple to implement alternative, but its compression method is very limited. For data, all it does is replace repeating character sequences. This allow it to store 5 - 255 repeating bytes using 4 bytes. That may or might not help for the targetted data. Also note that indexed files add 11 - 13 bytes record overhead + 15 bytes bucket overhead versus 1 byte record overhead for relative files. Not too much given 25,000 byt records, but still...

3) RMS indexed files do allow for true variable length records, which may be all that is neeeded. But the BUCKETS holding those records will still need to be big enough to hald the largest possible records, or 50 blocks in your case. Compression does not help that unless you specificy maximimum-record-size = 0 and a bucket size to reflect the largest compressed record.

4A) What problem are you really trying to solve? Storage spaces needs? Memory usage? Access speeds?
4B) How will the data be used? Sequential, true direct access? Sharing?

hth,
Hein.
debu_17
Occasional Visitor

Re: File/Record compression for direct access files

I am very much encouraged by the response and warm welcome to the forum.
I am very thankful to
John Gillings,Joseph Huber,Hoff,Hein for the very detailed answers/solutions.

Firstly I will intro myself in brief.

I work with OVMS for a Real time environment, for last several years.
The application is RT loss less reception and RT processing of continous satellite baseband and telemetry data.
hardware env. is DS-25 dual CPU, 4 GB ram, SCSi-ii disk pack.
Data acq. i/f is "DRQ-3B", DMA, ingest rate is 1Mbytes/sec conti.

The question i asked was for the application
(which is tightly coupled with the Rt acq. processes) which writes the "Data Archive".
The data is continous in nature, in what is called "format chunks".
The time available for this application to complete is about 200 msec. for a "chunk" . size is about 500kbytes.
I split the "chunk" to 20 records, of 25000bytes each(fixed),as this is the nearest i get to a disk block multiple of 512 bytes.

write operation is done using fortran OPEN
statement. using some logic the file is always "preopened" to avoid openeign the file when the actual fiel write starts.

the data archival req. per day is 60 GBytes.
bata is binary raw, 10 bit,( this i compact
to 30/32 bits, 3 words== 1 long word).

Now the issue for me is to reduce the write time to disk file, as i find more I/O bottleneck, rather than CPU availability in the system.
Also if i can compress the data further,
then the 60 Gb /day will come down,reducing the frequency of sevral other operations.

Here i agree with Josep H. that compressing will result in variable lenght records, as such is contrdictory to the present method of direct access file write with fixed rec. lenght and "block size" that i use.
but that operation i can modify by creating another step, of compressing the file records after -re-reading the archive file.
(and using the indexed/keyed access as suggested by J Huber)
I went thru the RMS section, but felt that
the "compress" has something more to do with the ASCII type of data.

In the meatime i am following up with the suggestions, that u all have given.
Thanks again.






Wim Van den Wyngaert
Honored Contributor

Re: File/Record compression for direct access files

Did you try playing with set rms/block ?

Wim
Wim
Wim Van den Wyngaert
Honored Contributor

Re: File/Record compression for direct access files

To avoid allocation requests, do you use INITIALIZE=size in the open (as you know the maximum size in advance, no need to ask more or less).

Wim
Wim
Wim Van den Wyngaert
Honored Contributor

Re: File/Record compression for direct access files

Not clear if defered write is allowed. If you are allowed to lose some data on a crash, try BUFFERED=YES (at OPEN).

May be post your exact OPEN, WRITE etc statements ?

Wim
Wim
Joseph Huber_1
Honored Contributor

Re: File/Record compression for direct access files


I have the feeling that the data writing is really sequential in nature, so a direct access method is not necessary for this step.
A sequential variable length unformatted write of compressed data will therefore gain a lot in speed, depending on the 'compressabilty" of the data.
In the less time-critical (?) phase of reading/analyzing the data later You have to pay for missing direct access of course.

To see how good the data really compress, I would simply make a test using gzip for a whole file to see the compression percentage.

Finally a possibility to achieve both goals, speed of writing and the possibility to direct access of individual records:
Avoid record I/O at all, and do block-mode I/O.
Reserve some blocks at the beginning of the file (depending on the maximum number of records/measurements to store) for a directory of block numbers. Then write each variable length (compressed) measurement into the next integral number of blocks in the file, remembering the starting block number in the "directory". At file close time, write the directory blocks to the file blocks reserved at the beginning of the file.
This way the reading/analyzing programs can randomly access a certain record almost as easy as a Fortran direct access read.
If You do not want to program this block I/O in Fortran, packages to do block I/O can be found on the freeware and VMS SIG disks ( http://mvb.saic.com ): look for *BIO*, or
http://wwwvms.mppmu.mpg.de/vmssig/src/for/lib_routines/bio.for

or my own

http://wwwvms.mppmu.mpg.de/~huber/util/bio.for

http://www.mpp.mpg.de/~huber
Robert Gezelter
Honored Contributor

Re: File/Record compression for direct access files

Debu,

Using multi-buffering and deferred write is an option, particularly if the data arrives in a "bursty fashion". It also helps deal with the inevitable unpredictability of disk access, where there are many reasons for delays, besides file open and extension requests.

This may be an application that cannot be run in a single synchronous thread, and may be a good candidate for the use of an event based restructuring. Additionally, compression algorithms have a tendency to be CPU intensive, and many off-the-shelf compression schemes are not coded with performance as their primary goal.

Having worked with various realtime data feeds many times in the past, there are many options, not all of which are applicable in each situation.

For a starting point, your recent posting says that over 60GB a day is the data rate. Is this continuous, or is is it some number of kilo/mega bytes per minute? The time behavior of the data feed does make a HUGE difference.

- Bob Gezelter, http://www.rlgsc.com
Hein van den Heuvel
Honored Contributor

Re: File/Record compression for direct access files


Rt acq. processes) which writes the "Data Archive". The data is continous in nature,

So it does not sound like direct access is truly needed. Just write as it comes, maybe leaving a breadcrumb behind for an (time) index table of sorts in the beginning of the file (using RFA's ?), or in a parallel file.

>> The time available for this application to complete is about 200 msec. for a "chunk" . size is about 500kbytes.

So that's like a walk in the park! Too easy... as long as no crazy extends/ fragmenation is happening.

>> I split the "chunk" to 20 records, of 25000bytes each(fixed),as this is the nearest i get to a disk block multiple of 512 bytes.

PLEASE do yourself a favor and forget about the 512 thingie. Honestly. Trying to exploit that, and not getting it exactly right, is the single biggest 'fake tuning' error I see being made over and over.

Write as much as allowed. If using RMS (trhough the fortran RTL) then 32K is the max. Use that. RMS will deals with crossing block boundaries. Let RMS do its multibuffer thing and 'WRITE BEHIND" and be happy. 4 - 10 buffers, 96, 124 or 127 blocks each should do the trick. Try each if possible. 127 is likely best, 124 makes it even to help certain disk IO sus systems, and 96 makes work for the XFC slightly easier IF write start at block 1.


>> write operation is done using fortran OPEN
statement. using some logic the file is always "preopened" to avoid openeign the file when the actual fiel write starts.

IF there is a serious IO issue, then you may want to consider QIOs, or even IO_PERFORM, because it allows for larger IO sizes, but it sounds like allowing RMS to do its thing is all you needs.

>> Now the issue for me is to reduce the write time to disk file, as i find more I/O bottleneck, rather than CPU availability in the system.

Right. So make the IO LARGE.
Please check the file with DIR/FULL... is it relative or sequential? Make sure the bucket size is 63 and the record size 63*512 - 1 is relative and make pick any record size larger then 4K for sequential files and just try : SET RMS/SEQ/BLOCK=127

>> Also if i can compress the data further,
then the 60 Gb /day will come down,reducing the frequency of sevral other operations.

So do some test compresses, as Joseph suggests. Try several algoritmes, as you appear to have non-standard data.

>> Here i agree with Josep H. that compressing will result in variable lenght records,

Ayup.

>> I went thru the RMS section, but felt that
the "compress" has something more to do with the ASCII type of data.

Agreed. It does not sound like RMS record compression will help one bit (sic).

Bob>> Using multi-buffering and deferred write

Semantics alert... I'm sure Bob meant WRITE BEHIND
The RMS DEFERRED write option only works in SHARED context with full locking so you don't want that (CPU) overhead. And furthermore it just fills buffers and postpones. It does not initiate behind your back IO. So it litterally does not help at all for continueous streams.

Good luck!
Hein.
Wim Van den Wyngaert
Honored Contributor

Re: File/Record compression for direct access files

May be also disable the XFC. Do a set file /cache=no on the directory the files are in. This disables xfc for all new files created in the directory. No need to dirty the XFC with large files you're not going to read again. You'll get better performance for the files that do need xfc.

And defrag your disks if not done already. This to avoid head movements.

Wim
Wim
Robert Gezelter
Honored Contributor

Re: File/Record compression for direct access files

Hein,

Bob did mean "Write Behind". [meant "deferring the write" as opposed to running synchronous with blocking waits.

While this type of situation IMHO is best implemented using reader and writer threads, RMS write behind gives a reasonable facsimile by avoiding most scheduling blocks.

- Bob Gezelter, http://www.rlgsc.com
Hoff
Honored Contributor

Re: File/Record compression for direct access files

Random comments...

Deja vu; I worked in the background with some folks some years ago with a DR-based application structured very similarly to what you describe, they had been upgrading their application for some years, and were still receiving buckets of RT data. (When I started with them, their receive data rates were a sizable chunk of the available backplane speeds on the boxes they where then using.)

First off, I'd run some tests and see how much "slop" I had to work with with here.

I'd probably not be initially looking to use RMS or the file system calls here; not as the primary design center. There's a whole lot more around processing the data that's key to what's going on than are the file calls. RMS or XQP or $qio calls are a very small fraction of the whole of the application here. (Fussy and important, yes, but...)

Compression can potentially help here if it's the I/O bandwidth in from or out to the spindle that's the limit. Given your comments, I'm going to assume you're secondary to the "real time" data receiver process(es); you're relatively asynchronous to that processing, and have some slop in how fast you can get data out. (If your code is primary, then what I'm referring to here would change.)

If I/O bandwidth is the limit, I'd look to upgrade the gear before spending a whole lot on re-coding or up-rating the software for the box. Striping and faster I/O devices would be the obvious path; controller RAID-0 or RAID-10, for instance. Or faster or SSD storage; some of the PCIe SSD devices, for instance. (You may or likely have already looked at this. And your box doesn't have PCIe; it's using the older and slower PCI-X bus.)

Here, if faster I/O doesn't cover your speeds and feeds, each record gets a record header and the compressed data; basically you're using (pick your compression) and you write your buffers out of a ring buffer "behind" your compression. (If this is the same I application I worked with, this is what I suggested last time.) You need track your compression buffers, your headers, and your in-flight and next write locations. This code is fussy. (If you choose RMS, you can avoid this as RMS happily deals with block spans and all that dreck for you. If you choose block I/O or $qio, you'll likely end up here.)

If software, this can be done in the application, or you could conceivably use a compressing pseudo device. This where the application tosses block data at something that looks like a disk, and the pseudo device and its ACP compresses it. (This is a gonzo-scale software-based kernel-mode approach, but it allows you to entirely separate your application code from your compression. There are other and easier ways to get modular compression, obviously. But this one looks like a disk to any applications referencing the data store.)

If you have CPU cycles to spare now, you probably won't when you bring software compression on-line. Regardless of whether you use application code or a shareable image or a pseudo-device-ACP approach. Might also look to use Macro64 and the EV56+ instruction set if the compilers won't generate these for you and if you can't find existing code; there was an instruction subset or two added for multimedia-related work. (This is if you really want to tie your code to Alpha and need serious go-fast.)

The stuff I'm working with now tends to generate 8 GB per hour per channel; that's a continuous and uncompressed rate. And the gear I work with can deal with multiple streams. (This bandwidth is not an unusual application in the present day, either.)

Another option here is semi-custom or dedicated hardware compression; where you off-load the problem to another widget. Depending on your data (and your lossless or lossy requirements) and your speeds and feeds, you might be able to utilize some of the available hardware. eg: JPEG2000 lossless.

If you go for compression, read up on sliding window lossless LZ1 compression as one potential option. At at the available loss-less compression. There are folks that do this compression in hardware (or firmware), too. And yes, it's entirely feasible to implement it in software, too.

And if you don't really need compression, definitely don't go there. Not until you're off-line.

This is headed toward a more detailed design and toward some prototypes. Toward a look at the disk spiral transfer rates and any planned changes in speeds and feeds. At the availability of a test system and test data or test (or teed) data feeds.

Stephen Hoffman
HoffmanLabs LLC

Phil.Howell
Honored Contributor

Re: File/Record compression for direct access files

Is there any scope for either buying faster disks or incorporating a RAID controller?
Otherwise look at increasing rms block and buffer counts, and use "write-behind"
Phil
debu_17
Occasional Visitor

Re: File/Record compression for direct access files

I was away from the computer for some time, so could not reply.
Many thanks for all the suggestions from all of u.
I am going thru each in detail, so i have saved the thread so far 2 a local pc disk.
will try out some of suggestions.

only one pt. here, the data rate is not "garden path walk", as the hardare does not have any memory. even at the low data rate by todays burst standards, it needs to be understood that the Acq. task handles continous data, i.e. at word level assured response time req. is about 1.9 usec. from the server. The Acq. task works well, and by avoiding OVMS internals, and using Qios,ASTS,and my home made flaging system (stored in system wide global pages) , it works 24x365.

I will get back after studying the options suggested by all of u.

thanks again.