topic Re: comparing using diff or something else in Operating System - HP-UX

comparing using diff or something else

u856100 — Tue, 17 Sep 2002 12:09:29 GMT

Chaps

I am trying to decide what method to use to compare two rather large files.

The two files have roughly 2.5 million records in each and each record consists of about 10 fields of approximately 30 characters each.

I want to attempt to 'diff' these files (or compare them in another way), and produce a log of the dicrepancies etc.

any ideas

thanks a million
John

Re: comparing using diff or something else

Leif Halvarsson_2 — Tue, 17 Sep 2002 12:22:37 GMT

Hi

Perhaps "comm" will do the job. The files need to be sorted. Then comm can report
- Lines common to both files
- Lines only in the firet file
- Lines only in the second file
in any combination. Have aa look at "man comm".

Re: comparing using diff or something else

James R. Ferguson — Tue, 17 Sep 2002 12:24:05 GMT

Hi John:

Given that the files are very large, you probably will need 'bdiff' which is 'diff' for "b"ig files. You might also look at 'cmp'.
See the man pages for more information on each of the above.

Regards!

...JRF...

Re: comparing using diff or something else

u856100 — Tue, 17 Sep 2002 12:28:17 GMT

thanks for your answers chaps,

for a file that is not going to grow more than 2.5 million, is cmp and comm suitable?

it's just that I prefer these two methods over diff.

thanks again
John

Re: comparing using diff or something else

James R. Ferguson — Tue, 17 Sep 2002 12:45:20 GMT

Hi (again) John:

In answer to your last question regarding the suitability of 'comm' and 'cmp' for million-record files, my advice is simply to try it.

"Your-milage-may-vary" always applies. I have no experience with these utilities and files this large.

It is noteworthy, however, that 'bdiff', 'cmp' and 'comm' are described as being capable of handling largefiles. See the section "Text Processing Commands" in the "Large Files White Paper":

http://docs.hp.com/hpux/onlinedocs/os/lgfiles4.pdf

Regards!

...JRF...

Re: comparing using diff or something else

A. Clay Stephenson — Tue, 17 Sep 2002 12:51:42 GMT

You failed to mention one very important aspect of the problem. Are you comparing textual data and are the records linefeed separated? If those conditions are true then bdiff is probably the weapon of choice but if this is binary data then the task becomes more difficult and may actually require a custom
script (e.g. Perl) to analyze the deltas in some meaning ful way.

Re: comparing using diff or something else

u856100 — Tue, 17 Sep 2002 13:01:41 GMT

Hi Clay,

the characters are textual data that are pipe delimited. The data is address data, i.e.

131|real street|Richmond|London|UK ....etc

so based onthis, you are suggesting bdiff is the man/woman for the job

Bummer, I don't get on with diff

thanks a bunch for your help guys!

John

Re: comparing using diff or something else

Brian Kinney — Thu, 19 Sep 2002 12:41:05 GMT

One of the most painful parts about large diffs is that the files are unsorted. I highly recommend that you first massage your data so any "keys" like account numbers are your first field (awk would be a good choice to manipulate this data around), then run a sort on both files.

After the data is prepared, you may want to try writing a Perl script to get a more valuable answer from the results other than bdiff or comm can give. If comm or bdiff is good enough, then ignore the rest of this message.

It *appears* your data is from some flat file database or spreadsheet, and you are looking for what has changed within a listing OR what has been removed from either list. Here's the PSEUDOCODE of what to accomplish. I'll use the word "key" as an account number, something that is unique to all records.

read from a
read from b
repeat
if a == b
print a to MATCHED file
read from a
read from b
else
# Is this a modified entry? If so, print
# out both a and b entries for my review.
#
# This is the only reason to write a
# script instead of using bdiff or comm
# if you don't need this specific data,
# don't bother with the scripting.
#
if key[a] == key[b]
print "WAS " a to MODIFIED file
print "NOW " b to MODIFIED file
read from a
read from b
else
# non matching keys, so move on
if key[a] < key[b]
print a to NOT_IN_B file
read from a
else
print b to NOT_IN_A file
read from b
endif
endif
endif
until (end of a) or (end of b)
while not (end of a)
print a to NOT_IN_B file
read from a
endwhile
while not (end of b)
print b to NOT_IN_A file
read from b
endwhile