Finding the same content in two files.

sudhapage · ‎10-19-2006

Hi all,

I want to find (or) list the same contents are there in two files.

Ex: Test1 & Test2 there are two files. If that two files are having the same words like "example". If I execute grep or diff or cmp commands, it should display the output "example" and what are all the same words it has.

In which command, I can get this details.

* It should work in SOLARIS also.

Regards,
Sudhakaran.K

James R. Ferguson · ‎10-19-2006

Hi:

By "word" I presume that you mean a whitespace (space, tab or newline) delimited string of characters.

If that's the case, you can create a simple Perl or awk script using hashes (associate arrays) to hold every word of each file. Depending upon what it is that you want to report, process the hashes accordingly.

Regards!

...JRF...

sudhapage · ‎10-19-2006

Hi James,

without space, newlines, tabs. I want the output.

If there is any same contents, it should show.

Regards,
Sudhakaran.K

Sandman! · ‎10-19-2006

Do the common word(s) you want complete words or are they embedded within other words or strings or both; for example the word "for" occurs as a separate word on line 1 and part of the string "therefore" in line 2...

now is the time for all
therefore

Peter Godron · ‎10-19-2006

Hi,
the way I read the request is:
1. Create an index of all the individual words in file1
2. Create an index of all the individual words in file2
3. Compare the two indexes and report on what words are common or not found

If this is not correct, could you please supply some example file1 and file2 and the expected output.

Christian Tremblay · ‎10-19-2006

Take a look at the comm command, it my do what you need simply:
(The files have to be sorted though)

comm - select or reject lines common to two sorted files

For details man comm

Chris

Peter Nikitka · ‎10-19-2006

Hi,

here is an example for 'comm' - the term 'same' is here seen as 'same line'.

sort test1 >test1.sorted
sort test2 | comm -12 test1.sorted -

You will get all common lines.

mfG Peter

The Universe is a pretty big place, it's bigger than anything anyone has ever dreamed of before. So if it's just us, seems like an awful waste of space, right? Jodie Foster in "Contact"

sudhapage · ‎10-19-2006

Hi

Here I am explaning, what exactly i want.

_________________________________________

File "a" contents:

apple
orange
egg
_________________________________________

File "b" contents:

sun
orange
space
_________________________________________

My output should be: "orange"
_________________________________________

So what are all the matching words, I wish to see in two files.

Regards,
Sudhakaran.K

Andrew Cowan · ‎10-19-2006

Could you not just sort both files and then run them through diff?

sort file1> f1
sort file2> f2
diff f1 f2

You could also use "sort -u" to remove duplicates.

sudhapage · ‎10-19-2006

Hi Andrew,

using diff command is not working after sorting.

I am not getting the same words.

Regards,
Sudhakaran.K

Peter Nikitka · ‎10-19-2006

Hi,

my previous solution will show exactly the requested output!

mfG Peter

The Universe is a pretty big place, it's bigger than anything anyone has ever dreamed of before. So if it's just us, seems like an awful waste of space, right? Jodie Foster in "Contact"

sudhapage · ‎10-19-2006

Hi peter,

Yes it's working. I want to implement this command in working environment. The example scnerio is:

The first file having word "solaris"

The second file having word "install solaris"

In this scnerio your command is not working!

It's not displaying the word "solaris" in output.

Please suggest.

Regards,
Sudhakaran.K

Peter Godron · ‎10-19-2006

Hi,
based on my previous post:
$ cat a.pl
#!/usr/bin/perl
while (<>) {
@words = split(/\W+/);
foreach (@words) {
print "$_\n";
}
}

$ cat a.sh
#!/usr/bin/sh
# Create the index for file a
a.pl a > a.out
# Remove any duplicates out of index a
sort -uo a.sor a.out
rm a.out
# Create the index for file b
a.pl b > b.out
# Remove any duplicates out of index b
sort -uo b.sor b.out
rm b.out
# Print data that is common to both files
comm -12 a.sor b.sor

Arturo Galbiati · ‎10-19-2006

Hi,
this runs fine on mu HPUX11i:

/tmp/> cat a
apple
orange
egg
/tmp/> cat b
sun
orange
space
> grep -f a b
orange

This command search all word in file a in file b and show the result.
I suggest you to use as first file the biggest file. Man grep for further info.

HTH,
Art

James R. Ferguson · ‎10-20-2006

Hi:

If you want to create output of each whitespace delimited string, do this, for example:

# perl -nlaF -e 'print for (@F)' /etc/hosts

Splitting on "word" characters with '\W' decomposes things like IP addresses, etc. Compare the above output to:

# perl -nl -e '@F=split(/\W+/);print for (@F)' /etc/hosts

Regards!

...JRF...

Hein van den Heuvel · ‎10-20-2006

Try this:

>perl -ne 'chomp; foreach (split) {if ($test) {print "$_\n" if delete $w{$_}} else {$w{$_}++}}; $test++ if eof' x.txt y.txt

In slow motion...

#>perl -ne ' Start perl looping over input(s) using next string as program
# chomp; Drop newline
# foreach (split) { Loop over words in each input line split by whitespace
# if ($test) { Test is set when first file gives EOF
# print "$_\n" if Print the word but only if...
# delete $w{$_} There was one deleted, thus present
# } else { Not seen eof yet
# $w{$_}++}}; Remember each word in hash %w
# $test++ if eof' Switch gears when eof seen
# x.txt y.txt Sample input files.

hth,
Hein.

Sandman! · ‎10-20-2006

Try the script below and invoke as:

# common.sh file1 file2

=======================================================
#!/usr/bin/sh

trap ' rm tmpfile' 0

while read line
do
echo $line | awk '{for(i=1;i<=NF;i++) print $i}' > tmpfile
while read word
do
grep -E "^$word | $word | $word[\.]$" $2 > /dev/null
if [ $? -eq 0 ]; then
echo $word >> awords
fi
done < tmpfile
done < $1

if [ -s "awords" ]; then
sort -uk1,1 awords > cwords && rm awords
fi
=======================================================

Hein van den Heuvel · ‎10-22-2006

Hi Sandman,

It sure looks like your solution would work from a functional perspective, but from a performance perspective it is just horrible!
The old n-square problem at its best.
- awk launched for every line in the first file.
- grep launched for every word in the first file.
- The second file read completely, over and over again, for every word in the first file.
Yikes!

Please review the Peter's solution which nicely reduces each input to unique words first and then uses a simple tool to read and compare each sorted wordlist once.

Cheers!
Hein.

# common.sh file1 file2

=======================================================
#!/usr/bin/sh

trap ' rm tmpfile' 0

while read line
do
echo $line | awk '{for(i=1;i<=NF;i++) print $i}' > tmpfile
while read word
do
grep -E "^$word | $word | $word[\.]$" $2 > /dev/null
if [ $? -eq 0 ]; then
echo $word >> awords
fi
done < tmpfile
done < $1

if [ -s "awords" ]; then
sort -uk1,1 awords > cwords && rm awords
fi
=======================================================

Peter Godron · ‎10-22-2006

Hi fellow posters,
I think we should give Sudhakaran some time to try out the various solutions and report back with his comments.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Finding the same content in two files.

Finding the same content in two files.

Re: Finding the same content in two files.

Re: Finding the same content in two files.

Re: Finding the same content in two files.

Re: Finding the same content in two files.

Re: Finding the same content in two files.

Re: Finding the same content in two files.

Re: Finding the same content in two files.

Re: Finding the same content in two files.

Re: Finding the same content in two files.

Re: Finding the same content in two files.

Re: Finding the same content in two files.

Re: Finding the same content in two files.

Re: Finding the same content in two files.

Re: Finding the same content in two files.

Re: Finding the same content in two files.

Re: Finding the same content in two files.

Re: Finding the same content in two files.

Re: Finding the same content in two files.