Compare 2 files and remove duplicates- PERL

Anand_30 · ‎03-29-2007

HI,

I have 2 files a & b. I need to compare these 2 files and remove all the entry from file 'b' which is same as that of file 'a' using PERL script

Can anyone please help me do this.

Thanks
Anand

A. Clay Stephenson · ‎03-29-2007

Before you decide that Perl is the answer, why don't you try doing a "man join" and see if a couple of ideas inside your own head don't collide.

If it ain't broke, I can fix that.

Hein van den Heuvel · ‎03-29-2007

That task can be accomplished by standard greg -f, or with perl.

You really want to ask yourself lots of questions on the quality and quantity of the data
- Megabytes or Gigabytes?
- Identifyable key field?
- Sorted
- Any performance consideration?
- Once only, or repeateable and this in need of serious error handling.

Anyway. With the terse question provided i believe the answer is:

$ cat > a
aap
noot
mies
teun
$ cat > b
noot
vuur
kees
mies
$ grep -v -f b a
aap
teun
$ grep -v -f a b
vuur
kees

$ perl -e 'open A,shift; foreach (){$a{$_}++}; open B,shift; foreach (){print unless $a{$_}}' b a
aap
teun

$ perl -e 'open A,shift; foreach (){$a{$_}++}; open B,shift; foreach (){print unless $a{$_}}' a b
vuur
kees
$

Perl formatted
open A,shift; # open file A, using first element from @ARGV

foreach () # loop over file A
{$a{$_}++}; # use each record as key in associative array, incrementing the element (making it true)
} # end loop
open B,shift; # open next file (needs and 'or die'
foreach (){ # loop over next
print unless $a{$_} # print... unless array element with key from A is true (exists)
} # done

Cheers,
Hein van den Heuvel
HvdH Performance Consulting

Dennis Handly · ‎03-29-2007

As Hein says, it depends on the sizes. If "a" and "b" are large and you don't care if you sort the file "b", you can sort both and use:
$ comm -13 a b

I just realized, if "b" has 2 lines that are exactly the same as "a", only one will be removed from the above command.

If "a" is small, "grep -vxf a b" would work.
(Hein should have used -x to match the whole line, when excluding.)

Dennis Handly · ‎03-29-2007

>(Hein should have used -x ...

Wait a minute. You were the one chiding people to be careful here with grep. :-)
I just copied that logic.
http://forums1.itrc.hp.com/service/forums/questionanswer.do?threadId=1113590

Hein van den Heuvel · ‎03-29-2007

Good catch Dennis.

In my defense, allthough the data is not described at all there is a suggestion that this is table-ish data, or something like a profile where full lines of data can be expected. But indeed, without a -x any bad input in file a, lets say just the letter 'a' can whipe out whole sections of file b.
I was thinking in perl terms where $_ has the data with terminator, forcing whole-line compares.

You are also correct on the BNAME/SUFFIX confusion in the other topic you point to.

Cheers,
Hein.

Anand_30 · ‎03-30-2007

Thanks for all the response.

The 2 files are actually very large each containing about 80 MB of data. Both the files contain some IDs. The data is not sorted.

I want to know the IDs that are in the first file but not present in the second file.

Thanks,
Anand

Hein van den Heuvel · ‎03-30-2007

Did you try the simple grep -vxf?

[the -x is optional in the case. Right Dennis? :-) :-]

80MB would be a little more than I'd like to feed perl to remember, but it should work.

The best solution is probably to simply sort each file and use 'comm'. See man comm

#comm -23 a.sorted b.sorted

Hein.

Hein van den Heuvel · ‎03-30-2007

oops, no luck. That should be comm -13
moral:
- read the man page carefully
- trust but verify on a small file set

Hein.

Peter Nikitka · ‎03-30-2007

Hi,

that's good:
>>
Both the files contain some IDs.
<<
You forgot to describe
- how to identify an ID
- if it's sufficient to extract the IDs of one file only
- if an ID is only part of a line or a whole line
- if checking vive versa is required
- what to to with lines NOT containing an ID

Assuming an ID is a string
IDnnn (n=0..9)
you can extract lines containing such an ID via
grep 'ID[0-9][0-9][0-9]' file_a

This is just a start - but do the five parts above of your homework first.

mfG Peter

The Universe is a pretty big place, it's bigger than anything anyone has ever dreamed of before. So if it's just us, seems like an awful waste of space, right? Jodie Foster in "Contact"

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Compare 2 files and remove duplicates- PERL

Compare 2 files and remove duplicates- PERL

Re: Compare 2 files and remove duplicates- PERL

Re: Compare 2 files and remove duplicates- PERL

Re: Compare 2 files and remove duplicates- PERL

Re: Compare 2 files and remove duplicates- PERL

Re: Compare 2 files and remove duplicates- PERL

Re: Compare 2 files and remove duplicates- PERL

Re: Compare 2 files and remove duplicates- PERL

Re: Compare 2 files and remove duplicates- PERL

Re: Compare 2 files and remove duplicates- PERL