Operating System - HP-UX
1849693 Members
7415 Online
104044 Solutions
New Discussion

Finding the same content in two files.

 
SOLVED
Go to solution
sudhapage
Regular Advisor

Finding the same content in two files.

Hi all,

I want to find (or) list the same contents are there in two files.

Ex: Test1 & Test2 there are two files. If that two files are having the same words like "example". If I execute grep or diff or cmp commands, it should display the output "example" and what are all the same words it has.

In which command, I can get this details.

* It should work in SOLARIS also.

Regards,
Sudhakaran.K
18 REPLIES 18
James R. Ferguson
Acclaimed Contributor

Re: Finding the same content in two files.

Hi:

By "word" I presume that you mean a whitespace (space, tab or newline) delimited string of characters.

If that's the case, you can create a simple Perl or awk script using hashes (associate arrays) to hold every word of each file. Depending upon what it is that you want to report, process the hashes accordingly.

Regards!

...JRF...
sudhapage
Regular Advisor

Re: Finding the same content in two files.

Hi James,

without space, newlines, tabs. I want the output.

If there is any same contents, it should show.

Regards,
Sudhakaran.K
Sandman!
Honored Contributor

Re: Finding the same content in two files.

Do the common word(s) you want complete words or are they embedded within other words or strings or both; for example the word "for" occurs as a separate word on line 1 and part of the string "therefore" in line 2...

now is the time for all
therefore
Peter Godron
Honored Contributor

Re: Finding the same content in two files.

Hi,
the way I read the request is:
1. Create an index of all the individual words in file1
2. Create an index of all the individual words in file2
3. Compare the two indexes and report on what words are common or not found

If this is not correct, could you please supply some example file1 and file2 and the expected output.
Christian Tremblay
Trusted Contributor

Re: Finding the same content in two files.

Take a look at the comm command, it my do what you need simply:
(The files have to be sorted though)

comm - select or reject lines common to two sorted files

For details man comm

Chris
Peter Nikitka
Honored Contributor

Re: Finding the same content in two files.

Hi,

here is an example for 'comm' - the term 'same' is here seen as 'same line'.

sort test1 >test1.sorted
sort test2 | comm -12 test1.sorted -

You will get all common lines.

mfG Peter
The Universe is a pretty big place, it's bigger than anything anyone has ever dreamed of before. So if it's just us, seems like an awful waste of space, right? Jodie Foster in "Contact"
sudhapage
Regular Advisor

Re: Finding the same content in two files.

Hi

Here I am explaning, what exactly i want.

_________________________________________

File "a" contents:

apple
orange
egg
_________________________________________

File "b" contents:

sun
orange
space
_________________________________________

My output should be: "orange"
_________________________________________

So what are all the matching words, I wish to see in two files.

Regards,
Sudhakaran.K
Andrew Cowan
Honored Contributor

Re: Finding the same content in two files.

Could you not just sort both files and then run them through diff?

sort file1> f1
sort file2> f2
diff f1 f2

You could also use "sort -u" to remove duplicates.
sudhapage
Regular Advisor

Re: Finding the same content in two files.

Hi Andrew,

using diff command is not working after sorting.

I am not getting the same words.

Regards,
Sudhakaran.K

Peter Nikitka
Honored Contributor
Solution

Re: Finding the same content in two files.

Hi,

my previous solution will show exactly the requested output!

mfG Peter
The Universe is a pretty big place, it's bigger than anything anyone has ever dreamed of before. So if it's just us, seems like an awful waste of space, right? Jodie Foster in "Contact"
sudhapage
Regular Advisor

Re: Finding the same content in two files.

Hi peter,

Yes it's working. I want to implement this command in working environment. The example scnerio is:

The first file having word "solaris"

The second file having word "install solaris"

In this scnerio your command is not working!

It's not displaying the word "solaris" in output.

Please suggest.

Regards,
Sudhakaran.K
Peter Godron
Honored Contributor

Re: Finding the same content in two files.

Hi,
based on my previous post:
$ cat a.pl
#!/usr/bin/perl
while (<>) {
@words = split(/\W+/);
foreach (@words) {
print "$_\n";
}
}


$ cat a.sh
#!/usr/bin/sh
# Create the index for file a
a.pl a > a.out
# Remove any duplicates out of index a
sort -uo a.sor a.out
rm a.out
# Create the index for file b
a.pl b > b.out
# Remove any duplicates out of index b
sort -uo b.sor b.out
rm b.out
# Print data that is common to both files
comm -12 a.sor b.sor
Arturo Galbiati
Esteemed Contributor

Re: Finding the same content in two files.

Hi,
this runs fine on mu HPUX11i:

/tmp/> cat a
apple
orange
egg
/tmp/> cat b
sun
orange
space
> grep -f a b
orange

This command search all word in file a in file b and show the result.
I suggest you to use as first file the biggest file. Man grep for further info.

HTH,
Art
James R. Ferguson
Acclaimed Contributor

Re: Finding the same content in two files.

Hi:

If you want to create output of each whitespace delimited string, do this, for example:

# perl -nlaF -e 'print for (@F)' /etc/hosts

Splitting on "word" characters with '\W' decomposes things like IP addresses, etc. Compare the above output to:

# perl -nl -e '@F=split(/\W+/);print for (@F)' /etc/hosts

Regards!

...JRF...
Hein van den Heuvel
Honored Contributor

Re: Finding the same content in two files.


Try this:

>perl -ne 'chomp; foreach (split) {if ($test) {print "$_\n" if delete $w{$_}} else {$w{$_}++}}; $test++ if eof' x.txt y.txt


In slow motion...


#>perl -ne ' Start perl looping over input(s) using next string as program
# chomp; Drop newline
# foreach (split) { Loop over words in each input line split by whitespace
# if ($test) { Test is set when first file gives EOF
# print "$_\n" if Print the word but only if...
# delete $w{$_} There was one deleted, thus present
# } else { Not seen eof yet
# $w{$_}++}}; Remember each word in hash %w
# $test++ if eof' Switch gears when eof seen
# x.txt y.txt Sample input files.

hth,
Hein.

Sandman!
Honored Contributor

Re: Finding the same content in two files.

Try the script below and invoke as:

# common.sh file1 file2

=======================================================
#!/usr/bin/sh

trap ' rm tmpfile' 0

while read line
do
echo $line | awk '{for(i=1;i<=NF;i++) print $i}' > tmpfile
while read word
do
grep -E "^$word | $word | $word[\.]$" $2 > /dev/null
if [ $? -eq 0 ]; then
echo $word >> awords
fi
done < tmpfile
done < $1

if [ -s "awords" ]; then
sort -uk1,1 awords > cwords && rm awords
fi
=======================================================
Hein van den Heuvel
Honored Contributor

Re: Finding the same content in two files.

Hi Sandman,

It sure looks like your solution would work from a functional perspective, but from a performance perspective it is just horrible!
The old n-square problem at its best.
- awk launched for every line in the first file.
- grep launched for every word in the first file.
- The second file read completely, over and over again, for every word in the first file.
Yikes!

Please review the Peter's solution which nicely reduces each input to unique words first and then uses a simple tool to read and compare each sorted wordlist once.

Cheers!
Hein.


# common.sh file1 file2

=======================================================
#!/usr/bin/sh

trap ' rm tmpfile' 0

while read line
do
echo $line | awk '{for(i=1;i<=NF;i++) print $i}' > tmpfile
while read word
do
grep -E "^$word | $word | $word[\.]$" $2 > /dev/null
if [ $? -eq 0 ]; then
echo $word >> awords
fi
done < tmpfile
done < $1

if [ -s "awords" ]; then
sort -uk1,1 awords > cwords && rm awords
fi
=======================================================
Peter Godron
Honored Contributor

Re: Finding the same content in two files.

Hi fellow posters,
I think we should give Sudhakaran some time to try out the various solutions and report back with his comments.