Operating System - HP-UX
1752633 Members
5842 Online
108788 Solutions
New Discussion юеВ

Compare 2 files and remove duplicates- PERL

 
Anand_30
Regular Advisor

Compare 2 files and remove duplicates- PERL

HI,

I have 2 files a & b. I need to compare these 2 files and remove all the entry from file 'b' which is same as that of file 'a' using PERL script

Can anyone please help me do this.

Thanks
Anand
10 REPLIES 10
A. Clay Stephenson
Acclaimed Contributor

Re: Compare 2 files and remove duplicates- PERL

Before you decide that Perl is the answer, why don't you try doing a "man join" and see if a couple of ideas inside your own head don't collide.
If it ain't broke, I can fix that.
Hein van den Heuvel
Honored Contributor

Re: Compare 2 files and remove duplicates- PERL

That task can be accomplished by standard greg -f, or with perl.

You really want to ask yourself lots of questions on the quality and quantity of the data
- Megabytes or Gigabytes?
- Identifyable key field?
- Sorted
- Any performance consideration?
- Once only, or repeateable and this in need of serious error handling.

Anyway. With the terse question provided i believe the answer is:

$ cat > a
aap
noot
mies
teun
$ cat > b
noot
vuur
kees
mies
$ grep -v -f b a
aap
teun
$ grep -v -f a b
vuur
kees

$ perl -e 'open A,shift; foreach (){$a{$_}++}; open B,shift; foreach (){print unless $a{$_}}' b a
aap
teun

$ perl -e 'open A,shift; foreach (
){$a{$_}++}; open B,shift; foreach (){print unless $a{$_}}' a b
vuur
kees
$

Perl formatted
open A,shift; # open file A, using first element from @ARGV

foreach (
) # loop over file A
{$a{$_}++}; # use each record as key in associative array, incrementing the element (making it true)
} # end loop
open B,shift; # open next file (needs and 'or die'
foreach (){ # loop over next
print unless $a{$_} # print... unless array element with key from A is true (exists)
} # done

Cheers,
Hein van den Heuvel
HvdH Performance Consulting

Dennis Handly
Acclaimed Contributor

Re: Compare 2 files and remove duplicates- PERL

As Hein says, it depends on the sizes. If "a" and "b" are large and you don't care if you sort the file "b", you can sort both and use:
$ comm -13 a b

I just realized, if "b" has 2 lines that are exactly the same as "a", only one will be removed from the above command.

If "a" is small, "grep -vxf a b" would work.
(Hein should have used -x to match the whole line, when excluding.)
Dennis Handly
Acclaimed Contributor

Re: Compare 2 files and remove duplicates- PERL

>(Hein should have used -x ...

Wait a minute. You were the one chiding people to be careful here with grep. :-)
I just copied that logic.
http://forums1.itrc.hp.com/service/forums/questionanswer.do?threadId=1113590
Hein van den Heuvel
Honored Contributor

Re: Compare 2 files and remove duplicates- PERL

Good catch Dennis.

In my defense, allthough the data is not described at all there is a suggestion that this is table-ish data, or something like a profile where full lines of data can be expected. But indeed, without a -x any bad input in file a, lets say just the letter 'a' can whipe out whole sections of file b.
I was thinking in perl terms where $_ has the data with terminator, forcing whole-line compares.

You are also correct on the BNAME/SUFFIX confusion in the other topic you point to.

Cheers,
Hein.
Anand_30
Regular Advisor

Re: Compare 2 files and remove duplicates- PERL

Thanks for all the response.

The 2 files are actually very large each containing about 80 MB of data. Both the files contain some IDs. The data is not sorted.

I want to know the IDs that are in the first file but not present in the second file.

Thanks,
Anand
Hein van den Heuvel
Honored Contributor

Re: Compare 2 files and remove duplicates- PERL

Did you try the simple grep -vxf?

[the -x is optional in the case. Right Dennis? :-) :-]

80MB would be a little more than I'd like to feed perl to remember, but it should work.

The best solution is probably to simply sort each file and use 'comm'. See man comm

#comm -23 a.sorted b.sorted

Hein.
Hein van den Heuvel
Honored Contributor

Re: Compare 2 files and remove duplicates- PERL

oops, no luck. That should be comm -13
moral:
- read the man page carefully
- trust but verify on a small file set

Hein.
Peter Nikitka
Honored Contributor

Re: Compare 2 files and remove duplicates- PERL

Hi,

that's good:
>>
Both the files contain some IDs.
<<
You forgot to describe
- how to identify an ID
- if it's sufficient to extract the IDs of one file only
- if an ID is only part of a line or a whole line
- if checking vive versa is required
- what to to with lines NOT containing an ID

Assuming an ID is a string
IDnnn (n=0..9)
you can extract lines containing such an ID via
grep 'ID[0-9][0-9][0-9]' file_a

This is just a start - but do the five parts above of your homework first.

mfG Peter
The Universe is a pretty big place, it's bigger than anything anyone has ever dreamed of before. So if it's just us, seems like an awful waste of space, right? Jodie Foster in "Contact"