Operating System - OpenVMS
1828225 Members
2345 Online
109975 Solutions
New Discussion

splitting up files with unknown record type

 
Peter Hofman
Frequent Advisor

splitting up files with unknown record type

Hi all,
I am wondering what options there are to split a file into smaller parts.
We have a script that collects files from different locations (nodes/disks) and put all those files together in a single file, using appen. But since that is not working with the original file organisation, the record type is set to unknown or unformatted. The append wll then work. Usually a maximum file size is configured in the script, but sometimes this is forgotten and the result is a file too large to handle.

So my question is: how to split it up again if the organisation is unformatted.
(I will get back on the exact file attributes. I do not have the info at hand right now)
18 REPLIES 18
John Gillings
Honored Contributor

Re: splitting up files with unknown record type

Peter,
I'm not sure I understand your question. Obviously you must have some criteria that you use to decide where the file gets split. Tyically, I'd expect the split point would be at a record boundary. But, if the file doesn't have record structure, how can that work?

Depending on the contents of the file, you might be able use SET FILE to change attributes to some "standard" format. You can then have a program read the file, interpret the data and split it up any way you like. Choosing the "best" format depends on your data.

Please post the exact file attributes of the files you're dealing with, and describe how you know where to split the data. Anything is possible!
A crucible of informative mistakes
Hein van den Heuvel
Honored Contributor

Re: splitting up files with unknown record type

Ditto.

I'm a little suprised how you entered this topic with half a story. No exact commands, no numbers as what is 'too large too handle', no exact error message why a normal append did not work, no indication of what might be in the individual files as to what content might set them appart.

For further help, show use the head and tail of a typical file. The dir/full output for an aggregate file. A DUMP of a block or two in an attached text tile. Stuff like that.

Also VMS itself has not notion of 'too large to handle'. So please help us understand why your application has that notion.
When those files are not too large, how do you use them? type/page, edit, and application only?

VMS has no native 'split' command. But is it trivial to write one in DCL or PERL to count record and/or bytes and hunt for record endings. If the file has no structure, then I'd gran perl with 'binmode'. Or I'd change the fiel attribute to fixed-512 and used dcl reads to read 512 byte chunks and look for structure within those chunks.

Cheers,
Hein.

Peter Hofman
Frequent Advisor

Re: splitting up files with unknown record type

John and Hein,

You're both right. However, I entered it when I was at home. I remembered somebody asking me this question, but I did not have a chance to look into it, and I did not want to forget to enter the topic or find a solution. I think I should have sent an e-mail to my e-mail address at work.

Anyway, since I have entered the topic, we might as well continue.

More info:
1) This is how the script does the append:

$ rfm=F$FILE_ATTRIBUTES(in_file, "RFM")
$ SET FILE/ATTRIBUTE=(RFM:UDF) 'in_file'
$ IF F$SEARCH(outfile).NES."" THEN -
$ SET FILE/ATTRIBUTE=(RFM:UDF) 'outfile'
$ APPEND /LOG /NEW_VERSION 'in_file' 'outfile'
$ IF $STATUS
$ THEN
$ DELETE /LOG 'in_file'
$ ENDIF
$ SET FILE/ATTRIBUTE=(RFM:'rfm') 'outfile'


2) File organisation of in and output files:

File organization: Sequential
Shelved state: Online
Caching attribute: Writethrough
File attributes: Allocation: 0, Extend: 0, Global buffer count: 0, No version limit, Contiguous best try
Record format: Stream_LF, maximum 0 bytes, longest 0 bytes
Record attributes: Carriage return carriage control
RMS attributes: None
Journaling enabled: None
File protection: System:RWED, Owner:RWED, Group:RE, World:
Access Cntrl List: None
Client attributes: None

3) To large to handle is in terms of the application that has to deal with it. If the file is too large, it runs out of memory.



Of course I should have gathered all info before entering this topic. It is clear to me now that without knowing how the file is build up, it is impossible to split.
The data in the file is ASN.1 encoded, so I'll probably have to write something to split the file. It can probably not be done with some simple scripting.

I am always interested in DCL examples for splitting file (or anything else).

Cheers,
Peter
Antoniov.
Honored Contributor

Re: splitting up files with unknown record type

Peter,
I dont'understand the scope of your application but it seems you will create a own library of files. Did you think use library instead append?
Look at this example:
$ LIBR/CREA/TEXT infile.TLB
$ LIBR/INS infile.TLB outfile1
$ LIBR/INS infile.TLB outfile2
Now infile contains 2 files.
You can split (extract) typing:
$ LIBR/EXTR=outfile1 infile.TLB

This is only an example and perhaps can't help you but might be a good idea to work.

Antonio Vigliotti
Antonio Maria Vigliotti
Peter Hofman
Frequent Advisor

Re: splitting up files with unknown record type

Nope,
its is not a library.
The process that creates the files has certain settings that have influence on the file size. The settings are either a number of ASN.1 encoded records in the file or the time the file is allowed to be open. In either case the file is closed and a new file is created.
There is more than one process doing that. All the files created by those processes are collected from different nodes/disks. The script that does the collecting, can be configured to append files (to prevent e.g. the disk index from running full).
After that the files are transferred to a system that processes the data in it.

Peter
Antoniov.
Honored Contributor

Re: splitting up files with unknown record type

Hi Peter,
do you need to read a unique file with included (appended) files?
If not, library works file for you.

Antonio Vigliotti
Antonio Maria Vigliotti
Bojan Nemec
Honored Contributor

Re: splitting up files with unknown record type

Peter,

If the record format of the files is not equal, you recieve a corrupted file with the append command, because SET FILE /ATTRIBUTES does not change the file structure, only file attributes are changed.

I make a small test:

I create a small sequential file VARIABLE record format:
$ CREATE A.TMP
aaaaaaaaaaaa
bbbbbbbbbbbb


Then I created another one with the CONVERT utility:

$ CONVERT/FDL=STREAM.FDL A.TMP B.TMP

Content of STREAM.FDL:

IDENT "16-JUL-2004 09:45:19 OpenVMS FDL Editor"

SYSTEM
SOURCE "OpenVMS"

FILE
ALLOCATION 0
BEST_TRY_CONTIGUOUS yes
EXTENSION 0
ORGANIZATION sequential

RECORD
BLOCK_SPAN yes
CARRIAGE_CONTROL carriage_return
FORMAT stream
SIZE 0


After that I append the files with yours script, so that in_file="B.TMP" and outfile="A.TMP".

The content of the resultant file was a corrupted file with the content:


aaaaaaaaaaaa
bbbbbbbbbbbbaaaaaaaaaaaa
bbbbbbbbbbbb


So I think, that you must modify the append script! First you must find out which organisation and record format is needed after append, then create a FDL for it and convert all files whit this FDL before appending.

Bojan Nemec
Peter Hofman
Frequent Advisor

Re: splitting up files with unknown record type

Both input and output have already got the same file attributes. Both are STREAM_LF. So no conversion is needed.

I think I cannot split the file with an existing VMS command. neither with a DCL script.

I think I will write a little program that understand ASN.1 and can find out the end of an ASN.1 record.
Antoniov.
Honored Contributor

Re: splitting up files with unknown record type

Peter,
you posted input file and output file have same format (stream_lf); why do you set to UDF before append? Your target file still remain UDF not stream_lf; I guess if you set your target file as stream_lf you can read and split (using appropriate DCL procedure).

Antonio Vigliotti
Antonio Maria Vigliotti
Wim Van den Wyngaert
Honored Contributor

Re: splitting up files with unknown record type

Bojan,

I don't think so because he said it was stream_lf.

You can ftp the file to unix and split it over there.

You could use dcl to split it (but create the file with an fdl for stream_lf because default is VFC) :
$ i=-1
$ nam=0
$ open/read inp 'p1'
$r:
$ i=i+1
$ read/end=e inp rec
$ j=(i/10)*10
$ if j .eq. i
$ then
$ if f$tr("outp") .nes. "" then close outp
$ nam=nam+1
$ open/write outp 'p2'_'nam'
$ endif
$ write outp rec
$ goto r
$e:
$ close inp
$ close outp

Wim
Wim
Bojan Nemec
Honored Contributor

Re: splitting up files with unknown record type

Peter,

Sorry I missed that all files are STREAM_LF. I also think that you must write a program.

If yours files has records that are to long (more than 32,767 bytes) you will have problems reading it. I have same problems reading XML files which are no human formated (no LF).


There are some pices of the program in C:

#include
#include
.
.
.
struct FAB infab;
struct RAB inrab;
int stat;
char buffer[1024];
char * filename;
.
.
.
.
infab = cc$rms_fab;
infab.fab$l_fna = filename;
infab.fab$b_fns = strlen(filename);
infab.fab$b_fac = FAB$M_BIO|FAB$M_GET;
inrab = cc$rms_rab;
inrab.rab$l_fab = &infab;
inrab.rab$l_bkt = 0;
inrab.rab$l_ubf = buffer;
inrab.rab$w_usz = 1024;

stat = sys$open (&infab);
if (!(stat & 1)) sys$exit (stat);
stat = sys$connect (&inrab);
if (!(stat & 1)) sys$exit (stat);

for (;;)
{
stat = sys$read (&inrab);
if (!(stat & 1))
{
if (stat == RMS$_EOF)
{
/* End of file */
break;
} else {
sys$exit (stat);
}
} else {
/*
Process data in buffer.
The length of the buffer is in inrab.rab$w_rsz
For example if you want to copy the buffer to another buffer

memcpy (newbuffer , buffer , inrab.rab$w_rsz);
*/
}
}

For more on programming with RMS look at:
http://h71000.www7.hp.com/doc/731FINAL/4523/4523PRO.HTML

Bojan Nemec
Wim Van den Wyngaert
Honored Contributor

Re: splitting up files with unknown record type

http://wwwvms.mppmu.mpg.de/vmssig/archive/S/
and search for split. I am unable to get it over here but maybe it is what you are looking for.

Wim
Wim
Hein van den Heuvel
Honored Contributor

Re: splitting up files with unknown record type

Back to the basics... I'd like to focus on:

" But since that is not working with the original file organisation, the record type is set to unknown or unformatted. "

Are you sure that did not work, or was someone confused by the warning messages?
Witness log below.

The really scary thing to mix/append through this 'udf' lie is stream and variable length (the VMS default). The is because the variable length recrod files have meta-data in the file (the record length) and you turned that into user data. Mixing various stram format, and the fiel 'print attributes' is sort of ok, as you can sort it out later. (a little tricky to terminate on LF or CR or CR+LF, but it can be sorted out).

In conclusion... you might not have a problem on the append side.
Next reply on the splitting.

Hein.


$ cre/fdl=nl: tmp_var.tmp
$ cre/fdl=tt: tmp_lf.tmp
record; format stream_lf;
$ appen/log tt: tmp_var.tmp
aap
noot
%APPEND-S-APPENDED, TNA78: appended to U$1:[HEIN]TMP_VAR.TMP;1 (2 records)
$ appen/log tt: tmp_lf.tmp
mies
teun
%APPEND-S-APPENDED, TNA78: appended to U$1:[HEIN]TMP_LF.TMP;1 (2 records)
$ cre/fdl=nl: tmp.tmp
$ append tmp_var.tmp tmp.tmp/log
%APPEND-S-APPENDED, U$1:[HEIN]TMP_VAR.TMP;1 appended to U$1:[HEIN]TMP.TMP;11 (2 records)
$ append tmp_lf.tmp tmp.tmp/log
%APPEND-W-INCOMPAT, U$1:[HEIN]TMP_LF.TMP;1 (input) and U$1:[HEIN]TMP.TMP;11 (output) have incompatible attributes
%APPEND-S-APPENDED, U$1:[HEIN]TMP_LF.TMP;1 appended to U$1:[HEIN]TMP.TMP;11 (2 records)
$ type tmp.tmp
aap
noot
mies
teun

So there was this "%APPEND-W-INCOMPAT".
But it did work as expected!

Wim Van den Wyngaert
Honored Contributor

Re: splitting up files with unknown record type

Knap Hein !

You can also use
$ convert/append inputf outputf
That will read inputf and convert the records to the format of outputf and then append them.

Wim
Wim
Hein van den Heuvel
Honored Contributor

Re: splitting up files with unknown record type


From what I gather so far, in the end you just have stream_lf records in a stream_lf file, or at least you could have, if the first file is that, and accept the warnigs from append.

So that should make is straightforward to read records in the file. IF the individual records arr in the (low) hundreds of bytes, then you can use DCL to split.
Be sure to pre-allocate new output and/or use large extents.

Supposedly you have lots of data, and thus
I'd recommend a little home-grown program in the language of your choice (C, BASIC, PERL...). If you choose a language then you
can optimize the split a lot by using TRUNCATE after splitting:
1) read untill limit
2) rememeber RFA
3) create next file
4) start copying
5) close next file
6) reposition input using RFA (quick!)
7) truncate original (quick!)


Somehow... and only you know this so far... you need to be able to tell the stsart of a new 'file' within the file.

You need to know the absolute max for split file sizes (OUT_MAX), and it is probably handy to know the size of the biggest appended files(ADD_MAX).

In DCL you might do something like (UNTESTED, NOT EVEN SYNTAX CHECKED!)

$MAX_BYTE = OUT_MAX_BYTE - ADD_MAX_BYTE
$MAX_LINE = OUT_MAX_LINE - ADD_MAX_LINE
$file = 1
$OPEN/READ IN input.dat
$
$NEW_FILE_LOOP:
$CLOSE/NOLOG out
$CREATE/FDL=SYS$INPUT out_''file'.dat
FILE; ALLOCATION 10000; EXTEN 5000;
RECORD; FORMAT STREAM_LF
$
$OPEN/APPEND out out_''file'.dat
$IF need_a_break THEN WRITE/SYMB out record
$file = file + 1
$need_a_break = 0
$lines = 0
$bytes = 0
$
$READ_LOOP:
$READ/END=DONE in record
$lines = lines + 1
$bytes = lines + F$LEN(record)
$IF lines .GT. MAX_LINE .OR. bytes .GT. MAX_LINE THEN need_a_break = 1
$IF need_a_break
$THEN
$ if record looks like start of a new file then goto NEW_FILE_LOOP
$ENDIF
$WRITE/SYMB out record
$GOTO READ_LOOP


In perl (again, just a brain dump: UNTESTED) that might be

$MAX_BYTE = $OUT_MAX_BYTE - $ADD_MAX_BYTE;
$MAX_LINE = $OUT_MAX_LINE - $ADD_MAX_LINE;
$file = 1;
open (IN,"open (OU,">output_$file.dat") || die "output";

while () {
$bytes += len($_);
if ($lines++ < $MAX_LINE && $bytes < $MAX_BYTE) {
print OU;
next;
}
if (/looks like a new file/) {
close (OU);
$file++;
open (OU,">output_$file.dat") || die "output";
$lines = $bytes = 0
print OU;
}
}

hth,
Hein.








Peter Hofman
Frequent Advisor

Re: splitting up files with unknown record type

My original question (submitted from home)was based upon what I remembered of the script (which I had not looked at for weeks) and on assumptions which turned out to be wrong when I got to work today.

The best way to split would be a small program instead of a script. Especially since the file cannot be split up just anyware because of the ASN.1 encoded records in them. I don't want to end up with half records at the end or beginning of the parts.
A program is also the best solution, since the amount of data in the files is megabytes, and not just hundreds of bytes.
It would be a good exercise for me anyway to create a program like that.

Thanks all for your help.
Martin P.J. Zinser
Honored Contributor

Re: splitting up files with unknown record type

Hi Peter,

at least at some point in time Digital had it then (TM). Check

http://h18000.www1.hp.com/info/SP3290/SP3290PF.PDF

for ASN.1 support and toolsets to make writing your program easier. I am not exactly sure if you can get this product still...

Greetings, Martin
Hein van den Heuvel
Honored Contributor

Re: splitting up files with unknown record type

Martin,
Interesting find!

Peter,

"A program is also the best solution, since the amount of data in the files is megabytes, and not just hundreds of bytes.
It would be a good exercise for me anyway to create a program like that."

Sounds right by me. I don't know whether you ever tried 'perl', but even if you did not you may want to consider it for jobs like this.
My example outlined a couple replies back should be pretty close to what you need.
Just replace that "looks like a new file" by the correct regular-expression to trigger on the start of a chunk of ASN.1. That, and fix my errors, because I did note try to run it.

Met vriendelijke groetjes,
Hein