Operating System - HP-UX
1837429 Members
3681 Online
110116 Solutions
New Discussion

comparing files - not sorted

 
u856100
Frequent Advisor

comparing files - not sorted

hi,

I am trying to compare two large files (hence trying bdiff), but I think I am asking for a bit too much.

the files I want to compare are just a load of reference numbers, but the sorted nature of the file will mean I get an inaccurate picture, eg :

file 1 file 2
------ ------
a1 a1
b1 b1
c1 b2
d1 c1
e1 d1
e1

I want to find out which records do not exist in the other negating the order. So if I use bdiff, four descrepancies will be flagged, when I only want to know about the fact that b2 exists in one file and not the other ... is this possible, am I asking too much. The actual files have 3 Million entries.

thanks a lot
john
chicken or egg first?
5 REPLIES 5
Thierry Poels_1
Honored Contributor

Re: comparing files - not sorted

Hi,

3 million entries?
I would start thinking of loading this stuff into a database, create the necessary indexes and query the database for the required results.

good luck,
Thierry.
All unix flavours are exactly the same . . . . . . . . . . for end users anyway.
H.Merijn Brand (procura
Honored Contributor

Re: comparing files - not sorted

perl using a tied hash.

warning, not tested.

#!/usr/bin/perl

use strict;
use warnings;

use DB_File;
tie my %f1, "DB_File", "f1_tie";
@ARGV = ("f1");
while (<>) {
chomp;
$f1{$_}++;
}
@ARGV = ("f2");
while (<>) {
chomp;
if (exists $f1{$_}) {
print "= $_ $f1{$_}\n";
$f1{$_} = 0;
}
else {
print "> $_\n";
}
}
for(keys%f1) {
$f1{$_} or next;
print "< $_ $f1{$_}\n";
}
untie %f1;
Enjoy, Have FUN! H.Merijn
John Palmer
Honored Contributor

Re: comparing files - not sorted

Hi,

If you do a unique sort of both files, 'comm' will tell you which records only appear in one file and not the other.

man comm

Regards,
John
u856100
Frequent Advisor

Re: comparing files - not sorted

Hi John,

I've just tried a quick test using

file1 file2
----- -----
a c
b d
c b
a
a

when I do $comm -12 file1 file2

it gives me: c

But obviously a,b,and c appear in both files

bit confused!

cheers
John

chicken or egg first?
John Palmer
Honored Contributor

Re: comparing files - not sorted

As stated in the man page, both files should be sorted, yours are not!

If you sort your example files, then run comm -12, you'll get the correct answer (b and c, a isn't in your second file!).

Actually, from your original post, you should be using comm -23 to list records in file1 and not file2.

Regards,
John