1752807 Members
5804 Online
108789 Solutions
New Discussion юеВ

Script Help

 
SOLVED
Go to solution
Cem Tugrul
Esteemed Contributor

Script Help

Hi forum,

Let's say i have a directory which includes thousand of files and i want to
compare each file's contents with the others bye one one and try to find out repeated
files(contents-records)

Help....
Our greatest duty in this life is to help others. And please, if you can't
26 REPLIES 26
Steven E. Protter
Exalted Contributor

Re: Script Help

The command is probably diff

diff file1 file2

You can build a script to read file lists and create diff output.

Do you need help setting up such a looping script?

SEP
Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
Rodney Hills
Honored Contributor

Re: Script Help

I might do the following-

cksum * | sort

This would run checksum on all the files then sort by checksum value. Those files that were the same would sort together with the same checksum value.

HTH

-- Rod Hills
There be dragons...
Fred Ruffet
Honored Contributor

Re: Script Help

Do you mean you want to suppress duplicate files or duplicate lines across different files.

Case 1 correponds to what SEP says (diff solution).

In case 2, you could cat all files through sort and uniq commands and get one file with unrepeated records.

Regards,

Fred
--

"Reality is just a point of view." (P. K. D.)
Ivajlo Yanakiev
Respected Contributor

Re: Script Help

you need loop:

for i in `ls`

do
for n in `ls`
do
diff $i $n >> /tmp/whatever
done
done



Cem Tugrul
Esteemed Contributor

Re: Script Help

Hi forum,
Thank's all answers...
Yes,i need help setting up such a looping script urgently...
Please help...

and Fred i need the contents(records) of all
files to compare and try to find out Ohh
these are the same files...
But my files names are different so maybe
best approach is files size...
Our greatest duty in this life is to help others. And please, if you can't
Rodney Hills
Honored Contributor

Re: Script Help

If you are looking for files that are the same, what about my "cksum" solution?

It would be better then checking file size.

The "diff" solution others have given are to show how the files are different.

Maybe a little more explaination on what you have and why you are looking for "sameness".

-- Rod Hills
There be dragons...
H.Merijn Brand (procura
Honored Contributor
Solution

Re: Script Help

Would my answer in this thread be the start to your solution?

http://forums1.itrc.hp.com/service/forums/questionanswer.do?threadId=749983

You can extend it to report duplicates like:

use Digest::MD5 qw( md5_hex );
use Digest::SHA1 qw( sha1_hex );
use File::Find;
my %arr;
find (sub {
-f or return;
local $/;
open my $p, "< $_" or die "$_: $!\n";
my $f = <$p>;
my $sum = md5_hex ($f) . sha1_hex ($f);
if (exists $arr{$sum}) {
print "File $File::Find::name is the same as file $arr{$sum}\n";
# unlink $_;
return;
}
$arr{$sum} = $File::Find::name;
}, ".");

Enjoy, Have FUN! H.Merijn
Enjoy, Have FUN! H.Merijn
Fred Ruffet
Honored Contributor

Re: Script Help

I think Rodney's solution is very good. Using diff between every combination of two files will make you parse each file a huge number of time, whereas cksum once on each file and work on a ckecksum file would only arse once each file.
It should look like this :
cksum * > cksum.tmp
sort cksum.tmp > cksum.out
Then you can look at cksum.out. If two following lines have the same checksum it is the same file.

Regards,

Fred
--

"Reality is just a point of view." (P. K. D.)
Cem Tugrul
Esteemed Contributor

Re: Script Help

Hi,

Let's say i have a directory as below;

-rw------- 1 cemt bsp 6 Dec 3 08:17 a.txt
-rw------- 1 cemt bsp 6 Dec 3 08:17 b.txt
-rw------- 1 cemt bsp 6 Dec 3 08:18 c.txt
-rw------- 1 cemt bsp 9 Dec 3 08:22 d.txt
-rw------- 1 cemt bsp 6 Dec 3 08:22 e.txt

Now i try to find out which files are the same???if you are a magician you can easily
say the file "a.txt" and "c.txt" are the same file!!!
Why;
Before cat these 5 files we can easily ignore the file "d.txt" because it's size is different than the others so;
Let's cat each file;

$ cat a.txt
11111
$ cat b.txt
22222
$ cat c.txt
11111
$ cat e.txt
33333

And we decided "a.txt" and "c.txt" are the same(repeated file)....is it clear???

Now i have more than 2000 files and try to find out repeated files in a directory?


Our greatest duty in this life is to help others. And please, if you can't