Operating System - HP-UX
1819919 Members
2271 Online
109607 Solutions
New Discussion юеВ

Re: comparing using diff or something else

 
SOLVED
Go to solution
u856100
Frequent Advisor

comparing using diff or something else

Chaps

I am trying to decide what method to use to compare two rather large files.

The two files have roughly 2.5 million records in each and each record consists of about 10 fields of approximately 30 characters each.

I want to attempt to 'diff' these files (or compare them in another way), and produce a log of the dicrepancies etc.

any ideas

thanks a million
John
chicken or egg first?
7 REPLIES 7
Leif Halvarsson_2
Honored Contributor

Re: comparing using diff or something else

Hi

Perhaps "comm" will do the job. The files need to be sorted. Then comm can report
- Lines common to both files
- Lines only in the firet file
- Lines only in the second file
in any combination. Have aa look at "man comm".
James R. Ferguson
Acclaimed Contributor

Re: comparing using diff or something else

Hi John:

Given that the files are very large, you probably will need 'bdiff' which is 'diff' for "b"ig files. You might also look at 'cmp'.
See the man pages for more information on each of the above.

Regards!

...JRF...
u856100
Frequent Advisor

Re: comparing using diff or something else

thanks for your answers chaps,

for a file that is not going to grow more than 2.5 million, is cmp and comm suitable?

it's just that I prefer these two methods over diff.

thanks again
John
chicken or egg first?
James R. Ferguson
Acclaimed Contributor

Re: comparing using diff or something else

Hi (again) John:

In answer to your last question regarding the suitability of 'comm' and 'cmp' for million-record files, my advice is simply to try it.

"Your-milage-may-vary" always applies. I have no experience with these utilities and files this large.

It is noteworthy, however, that 'bdiff', 'cmp' and 'comm' are described as being capable of handling largefiles. See the section "Text Processing Commands" in the "Large Files White Paper":

http://docs.hp.com/hpux/onlinedocs/os/lgfiles4.pdf

Regards!

...JRF...
A. Clay Stephenson
Acclaimed Contributor

Re: comparing using diff or something else

You failed to mention one very important aspect of the problem. Are you comparing textual data and are the records linefeed separated? If those conditions are true then bdiff is probably the weapon of choice but if this is binary data then the task becomes more difficult and may actually require a custom
script (e.g. Perl) to analyze the deltas in some meaning ful way.
If it ain't broke, I can fix that.
u856100
Frequent Advisor

Re: comparing using diff or something else

Hi Clay,

the characters are textual data that are pipe delimited. The data is address data, i.e.

131|real street|Richmond|London|UK ....etc

so based onthis, you are suggesting bdiff is the man/woman for the job

Bummer, I don't get on with diff

thanks a bunch for your help guys!

John
chicken or egg first?
Brian Kinney
Frequent Advisor
Solution

Re: comparing using diff or something else

One of the most painful parts about large diffs is that the files are unsorted. I highly recommend that you first massage your data so any "keys" like account numbers are your first field (awk would be a good choice to manipulate this data around), then run a sort on both files.

After the data is prepared, you may want to try writing a Perl script to get a more valuable answer from the results other than bdiff or comm can give. If comm or bdiff is good enough, then ignore the rest of this message.

It *appears* your data is from some flat file database or spreadsheet, and you are looking for what has changed within a listing OR what has been removed from either list. Here's the PSEUDOCODE of what to accomplish. I'll use the word "key" as an account number, something that is unique to all records.

read from a
read from b
repeat
if a == b
print a to MATCHED file
read from a
read from b
else
# Is this a modified entry? If so, print
# out both a and b entries for my review.
#
# This is the only reason to write a
# script instead of using bdiff or comm
# if you don't need this specific data,
# don't bother with the scripting.
#
if key[a] == key[b]
print "WAS " a to MODIFIED file
print "NOW " b to MODIFIED file
read from a
read from b
else
# non matching keys, so move on
if key[a] < key[b]
print a to NOT_IN_B file
read from a
else
print b to NOT_IN_A file
read from b
endif
endif
endif
until (end of a) or (end of b)
while not (end of a)
print a to NOT_IN_B file
read from a
endwhile
while not (end of b)
print b to NOT_IN_A file
read from b
endwhile





"Any sufficiently advanced technology can be indistinguishable from magic" Arthur C. Clarke. My corollary - "Any advanced technology can be crushed with a sufficently large enough rock."