System Administration
cancel
Showing results for 
Search instead for 
Did you mean: 

Script question - comm limits

SOLVED
Go to solution
Raynald Boucher
Super Advisor

Script question - comm limits

Hello all.

I used "comm -23 file_a file_b" successfully with files containing around 4500 records.

The statement fails however with our production files containing 1.5 million entries each.

Are there some published limitations for this process?

Thanks

RayB
11 REPLIES
Tingli
Esteemed Contributor

Re: Script question - comm limits

man comm will give you the file size limit.
Hein van den Heuvel
Honored Contributor

Re: Script question - comm limits


Hmmm,

Does the man page really indicate a memory limit?

The is none mentioned in:
http://docs.hp.com/en/B2355-90689/comm.1.html

There is also no reason to expect a memory limitation. Just loop and compares.

The algoritme surely only ever has to look at 1 record from each file as they have to be in order.

For each set of two it can determine what to do: print 1,2 or 3, and read next from 1, read next from 2 or read next from both.
No reason to remember anything really.
(Okay may an extra line buffer to deal with duplicate input records).

fwiw,
Hein (not online right now)
TwoProc
Honored Contributor

Re: Script question - comm limits

Just a guess - but the man page says that it assumes "file1 and file2 are ordered ...".

Have you run "sort" on both file1 and file2 before you ran the "comm" command on them?

We are the people our parents warned us about --Jimmy Buffett
Dennis Handly
Acclaimed Contributor

Re: Script question - comm limits

>TwoProc: but the man page says that it assumes "file1 and file2 are ordered ...".

If they are not sorted, you get unpredictable results. Since Raynald didn't explain what "failed" meant, that could be it.
Raynald Boucher
Super Advisor

Re: Script question - comm limits

Hello all,
For supplementary information, both filenames contain single field entries of 2 to 7 numbers followed by ".doc" (ex. 23456.doc) and have been sorted using "sort -n".
Their list looks like 12.doc, 14.doc, 123.doc, 137.doc, 1234.doc etc

"comm -23 file_a file_b" returns lines which exist in both files.
It works perfectly on smaller files.

LC_COLLATE and LANG variables are not set.

RayB
Hakki Aydin Ucar
Honored Contributor

Re: Script question - comm limits

"comm -23 file_a file_b" returns lines which exist in both files.
->>

correction:

Options 1, 2, or 3 suppress printing of the corresponding column.
Thus comm -12 prints only the lines common to the two files; so you have to use ;
comm -12 file_a file_b to get lines available in both file.
Dennis Handly
Acclaimed Contributor
Solution

Re: Script question - comm limits

>have been sorted using "sort -n". Their list looks like 12.doc, 14.doc, 123.doc, 137.doc, 1234.doc etc

This is NOT sorted. comm(1) requires them to be in collating order by chars. I.e. no sort options. This is the "proper" ASCII order:
12.doc 123.doc 1234.doc 137.doc 14.doc

>Hakki: correction:

You read it wrong. RayB was saying that since he got files that existed in both files, it was broken.
TwoProc
Honored Contributor

Re: Script question - comm limits

Raynald, I think Dennis saying to have them sorted, but without the "-n" argument to the sorted commands.

I realize that it worked on smaller files, but I was thinking that it the heuristic for the would have a whole lot less to keep tabs on if it is sorted than if it is not. That is, what if line 1 matches line 1.5 million - the program may try keep those two lines, and everything in between in memory, and die. Whereas, if the sort was done before, the starting and end pointers in both files for correlating to active memory locations would/could be much closer together, and thus possibly much smaller.

This would somewhat analagous to the reason one uses the "-depth" indicator in find commands for large directory trees, so that you keep fewer file tree locations pointing to memory locations in an unresolved state. Basically, keeping less things to process at one point in time.

So, I was just thinking that "it might be" that the memory error could be just a case of somehow the program having too many "open issues" in process, and not enough closed ones on the really large files. In fact, it may be that *this* is the only reason what it would like the data sorted up front - just so that the algorithm has fewer "balls in the air" at any one time, and therefore won't run out of working space. My guess is that this theory is probably correct, since you've indicated that the program works on small files, even though according Dennis, you've probably performed the sorts wrong. That is, the program only needs the data sorted to reduce the amount workspace it needs to resolve the issues in one pass, otherwise, it would probably have to make multiple passes over the files with some creative block swapping to approach a result with huge files.

Just thinking about how I would have written the comm program, and how if I was having to write that program, the first thing I'd like is that it is a given that it sorted some sort of standard way that I could count on in advance too. Otherwise, I could envision that memory handling all at once could get out of control in large sets.

Try it alpha sorted as Dennis suggested and see. Maybe, maybe not - it's a possibility!
We are the people our parents warned us about --Jimmy Buffett
Dennis Handly
Acclaimed Contributor

Re: Script question - comm limits

>TwoProc: I think Dennis saying to have them sorted, but without the "-n" argument to the sorted commands.

Didn't I say that? :-)

>I was thinking that it the heuristic for the would have a whole lot less to keep tabs on if it is sorted than if it is not.

If you don't want it sorted, you use diff(1).
The algorithm is very simple as Hein says, you look at two records and write the one that compares less than. Or both if equal.

>the algorithm has fewer "balls in the air" at any one time

Both balls are on the table in plain sight.
There are two FILE* buffers and two line buffers of LINE_MAX.

>Try it alpha sorted as Dennis suggested and see. Maybe, maybe not - it's a possibility!

("No! Try not. Do, or do not. There is no try." :-)
RayB gave the file content and they failed the "sort -c file" check.


Hein van den Heuvel
Honored Contributor

Re: Script question - comm limits

I had an existing Perl script to compare lines, based on a key value first, and the whole line next, printing matching lines.

It can easily be adapted to use different key functions, or different outputs (non-matching)

Here it is, using the first sequence of numbers on a line as keys.

--------------- comm_12_numeric.pl ---------
#
# look for matching lines based on a key value
#
# Open files
#
$name = shift @ARGV or die "Must provide first filename";
open F1, "<$name" or die "Could not read file $name";
$name = shift @ARGV or die "Must provide second filename";
open F2, "<$name" or die "Could not read file $name";


my ($f1, $f2, $k1, $k2);

# Read a line from F1 into global $f1, and return its key value.
sub k1() {
$f1 = ;
exit unless defined ($f1);
$f1 =~ m/^(\d+)/;
return $1;
}

# Read a line from F2 into global $f2, and return its key value.
sub k2() {
$f2 = ;
exit unless defined ($f2);
$f2 =~ m/^(\d+)/;
return $1;
}

#
$k1 = &k1;
$k2 = &k2;

while ( 1 ) {
if ($k1 == $k2) {
print $f1 if ($f1 eq $f2);
$k1 = &k1;
$k2 = &k2;
} else {
if ($k1 > $k2) {
$k2 = &k2 while $k1 > $k2
} else {
$k1 = &k1 while $k2 > $k1
}
}
}
-----------------


For sake of completeness a perl equivalent for 'comm -12', printing matching lines ordered using the whole line. Note how only 2 string variables and 2 file variables are used.

---------------- comm_12_text.pl ---------
#
# Open files
#
$name = shift @ARGV or die "Must provide first filename";
open F1, "<$name" or die "Could not read file $name";
$name = shift @ARGV or die "Must provide second filename";
open F2, "<$name" or die "Could not read file $name";

my $f1 = ;
my $f2 = ;
while (defined ($f1) & defined ($f2)) {
if ($f1 eq $f2) {
print $f1;
$f1 = ;
$f2 = ;
} else {
if ($f1 gt $f2) {
$f2 = while defined ($f2) & $f1 gt $f2;
} else {
$f1 = while defined ($f1) & $f2 gt $f1;
}
}
}
----------

Cheers,
Hein.
Raynald Boucher
Super Advisor

Re: Script question - comm limits

Hello all,

The solution is to use sort with no options on your source files.

To confirm,
1- I resorted the files and it worked properly.
2- I took a subset of my original files (first 2000 lines presorted with sort -n) to test with reduced numbers; comm failed.

comm succeeded with my test data because all my test data starts with the same string.
I extracted all entries matching '211[0-9]*.doc' to reduce numbers but that made the numeric sort match the alpha sort.

Thanks all

RayB