Operating System - HP-UX
1834483 Members
3526 Online
110067 Solutions
New Discussion

how can I compare lines by words. ...

 
SOLVED
Go to solution
someone_4
Honored Contributor

how can I compare lines by words. ...

Hi everyone -

Is there a way to compare two files line by line?

For example I have filea.

dpd0
dpd6 mdmdr tgtd
dpd6 sd0r dpd6a fld0r

and fileb:

dpd0
dpd6 tgtd mdmdr1
dpd6 sd0r fld0r dpd6a

I have tried diff sdiff but the problem that I have is that line 3 is the same.

They have the same words but just in different order. The diff commands see them as different. And I would like them to be considered as the same.

Line two is different and I would just like the lines and the line number to compare to a spreadsheet that I have.

Of course the files may be bigger and longer different number of words in the line. I have come to the conclusion that shell may not do this too well and I am having a hard time poking around with perl.

THanks
Richard
10 REPLIES 10
Hein van den Heuvel
Honored Contributor

Re: how can I compare lines by words. ...


This sounds much like the problem I helped with a few days ago:

http://forums1.itrc.hp.com/service/forums/questionanswer.do?threadId=998340

At the very least it will give you a good starting point for a perl script.

Hein.
someone_4
Honored Contributor

Re: how can I compare lines by words. ...

hmmmm

I know there are a couple of us here working on this. Because I asked around for help and no one has gotten back to me.

Let me take a look at that.

Richard
Hein van den Heuvel
Honored Contributor
Solution

Re: how can I compare lines by words. ...

Here is a working solution... but with assumptions.

Is assumes the files are basically the same, just the line contents might vary

You may need to sort the files first, and/or build in logic to read individual file records until they line up again.

It helps to know more about the files to do so.
For example... is that first word sort of a key, indicative of a zone in the file?

$ type 1.tmp
dpd0
dpd6 mdmdr tgtd
dpd6 sd0r dpd6a fld0r

$ type 2.tmp
dpd0
dpd6 tgtd mdmdr1
dpd6 sd0r fld0r dpd6a

$ perl tmp.pl 1.tmp 2.tmp
word mdmdr1 not found on line 2 in 2.tmp
word mdmdr not found on line 2 in 1.tmp
Files are NOT the same
$
$ type tmp.pl
$f1 = shift @ARGV or die "Must provide first filename";
$f2 = shift @ARGV or die "Must provide second file name";
open F1, "<$f1" or die "Could not read file $f1";
open F2, "<$f2" or die "Could not read file $f2";
while () {
$line++;
chomp;
undef %words;
foreach $word (split) {
$words{$word}++;
}
$_ = ;

foreach $word (split) {
if ($words{$word}) {
delete $words{$word};
} else {
print "word $word not found on line $line in $f2\n";
$not = " NOT";
}
}
foreach $word (keys %words) {
print "word $word not found on line $line in $f1\n";
#not = " NOT";
}
}
print "Files are${not} the same\n";
$
Hein van den Heuvel
Honored Contributor

Re: how can I compare lines by words. ...

Afterthoughts...

If performance is a concern, then the script above should probaly first see if there is a simple match between lines in the files before splitting it into words.

Depending on the exact needs and data attributes , I would probably (re)write it as a 'diff' post-filter:

# diff f1 f2 | perl diff-word-compare.pl

So diff would do the bulk of the work, and the perl filter could remove sections seen as different, but with the same words and thus considered the same for the intended usage.

fwiw,
Hein.
someone_4
Honored Contributor

Re: how can I compare lines by words. ...

Hi

would you mind taking this conversation offline?

Can you send me an email

richard@rleon.net
Cem Tugrul
Esteemed Contributor

Re: how can I compare lines by words. ...

Richard,
Once upon a time i also needed such a kind of request as yours and i had approx 2500
files and tried to findout whether there was any duplicated files(same contents of files) in my dir
and found a script like;
#more find_dup.pl
#!/opt/perl64/bin/perl
use strict;
use warnings;

use Digest::MD5 qw( md5_hex );
use Digest::SHA1 qw( sha1_hex );
use Cwd qw( getcwd );
use File::Find;

my @dir = @ARGV && -d $ARGV[0] ? @ARGV : getcwd;
my %arr;
find (sub {
-f or return;
local $/;
open my $p, "< $_" or die "$_: $!\n";
my $f = <$p>;
my $sum = md5_hex ($f) . sha1_hex ($f);
if (exists $arr{$sum}) {
print "File $File::Find::name is the same as file $arr{$sum}\n";
# unlink $_;
return;
}
$arr{$sum} = $File::Find::name;
}, @dir);

Hope it helps!!!

Good Luck,
Our greatest duty in this life is to help others. And please, if you can't
Tor-Arne Nostdal
Trusted Contributor

Re: how can I compare lines by words. ...

Hi Richard
What you need is to sort each line, then compare.

I have made a Posix script that might solve it for you.

#!/bin/sh

#Input files
FILEA="./filea"
FILEB="./fileb"

# Simple check to see if identical
diff $FILEA $FILEB >/dev/null && {
print "Files are identical"
exit 0
}

# Check which files have most lines
LINES_A="`cat $FILEA |wc -l`"
LINES_B="`cat $FILEB |wc -l`"
typeset -i LINES_A LINES_B
[[ $LINES_B -gt $LINES_A ]] && let LINES_A=$LINES_B

# Open files for read
exec 3<$FILEA
exec 4<$FILEB

let LINE=0
until [[ "$LINE" = "$LINES_A" ]]
do
let LINE=${LINE}+1
read -u3 LINEA
read -u4 LINEB
SORTA="`echo $LINEA |tr ' ' '\012'| sort | tr '\012' ' '`"
SORTB="`echo $LINEB |tr ' ' '\012'| sort | tr '\012' ' '`"
print "Matching line : $LINE - \c"
[[ "$SORTA" = "$SORTB" ]] && print "EQUAL" || print "DIFFER"
done

/Tor-Arne
I'm trying to become President of the state I'm in...
H.Merijn Brand (procura
Honored Contributor

Re: how can I compare lines by words. ...

# sort a line on words:
$line = join " " => sort split m/ / => $line, -1;

or in one blow

"@{[sort split/ /,$line1]}" eq "@{[sort split/ /,$lineb]}" and print "Lines are equal\n";

Enjoy, Have FUN! H.Merijn
Enjoy, Have FUN! H.Merijn
Tor-Arne Nostdal
Trusted Contributor

Re: how can I compare lines by words. ...

If you can't sort this out ;-) try:
http://www.perl.com/doc/FMTEYEWTK/sort.html

F = Far
M = More
T = Than
E = Everything
Y = You
E = Ever
W = Want
T = To
K = Know

- about sort

/Tor-Arne
I'm trying to become President of the state I'm in...
someone_4
Honored Contributor

Re: how can I compare lines by words. ...

Great help :)

Thanks everyone