Operating System - HP-UX
1834459 Members
2782 Online
110067 Solutions
New Discussion

Need a perl script for below mentioned requirement.

 
Swetha reddy
Occasional Contributor

Need a perl script for below mentioned requirement.

there are two ascii files say with same header (column values)

old ascii file:
column1||column2|column3|column4

new ascii file:

column1|column2|column3|column4

both the ascii files contains millions of records.

here i need to compare both the files ..

onething to say that old ascii file fourth column contains all the junk data values and in the new ascii file fourth column contains the proper data.

Here i need to compare both the files and give the comparison report saying that all of the other three columns values in both are matching or not. excluding the fourth column values as cannot compare the junk data with the proper data anyway in this scenario all of them will fail.

can we exclude the fourt column and test with the remaining three column values to compare both new and old ascii file.

Thanks in advance
Swetha





3 REPLIES 3
Hein van den Heuvel
Honored Contributor

Re: Need a perl script for below mentioned requirement.

q1) is the field seperator indeed a pipe "|"

q2) are the files pre-sorted.

q3) can one column, or column combination be treated as a (unique) key and the other(s) as data values?

q4) any chance one file would fit in memory?
If so, just read one file entirely and store into a hash, either keyed by the unique key columns or by the whole record and count occurences. Next read the other file and lookup in first. If found, subtrace count or delete hash entry. If not found report. At end of second file report remaining (non-zero) entries in hash from first file.


q5) ever checked out perl and (DBM) databases?
Read this for example, and google for more: http://www.unix.org.ua/orelly/perl/perlnut/ch12_01.htm

Here is what I would do though...

For each file, cut first 3 columns, pipe into sort and save.
Now 'diff', sdiff, comm or uniq the 3-columns-and-sorted work files to your hearts content.
Expirement with 100 or so records in each first of course.


Enjoy!
Hein.
Swetha reddy
Occasional Contributor

Re: Need a perl script for below mentioned requirement.

q1) is the field seperator indeed a pipe "|"

yes

q2) are the files pre-sorted.

yes

For each file, cut first 3 columns, pipe into sort and save.
Now 'diff', sdiff, comm or uniq the 3-columns-and-sorted work files to your hearts content.
Expirement with 100 or so records in each first of course.

if it is like 2 column. means 1,3 and 4 i need comparision.

example:

first file:

col1|col2|col3|col4
2|3|ad|2
2|2|addad|2
2|23|dsdsfd|3

second file :
col1|col2|col3|col4
2|3|3|2
2|2|3|2
2|23|3|3

on comparison it should not check the column 3 values. as the first file 3rd column value contains all the junk data and second one contains the real data. in this case if compare all of the rows get failed. we need to aviod this column.

Output should be like this:
Total number of records: XXX
Total number of matched records :xxx
Total number of unmatched records: xxx





Hein van den Heuvel
Honored Contributor

Re: Need a perl script for below mentioned requirement.

It's a good thing you showed that sample data.
For starters... that data is NOT sorted.
"2|3" does not come before "2|2" best I know.

Next, even if it were sorted, it is not 'obvious' that you intended numeric comparisons.
So the script not only has to pick apart the records, but compare based on field type, or normalize, and then compare.

I had a sample compare script floating around which I adapted to your needs.
See below, and attached.

Be sure to try it on small files first.

Good luck,

Hein.


use strict;
my ($f1, $f2, $key_1, $key_2, $common, $file_1, $file_2, $read_1, $read_2);
$f1 = @ARGV[0];
$f2 = @ARGV[1] or die "Must provide two files to compare";
open (F1, $f1) or die "Error open file 1: $f1";
open (F2, $f2) or die "Error open file 2: $f2";
my $high = "999999999|999999999|x|999999999";
; ; # skip headers
$common = $file_1 = $file_2 = 0;

while (1) {

# compare keys, and flag which files to read.

if ($key_1 eq $key_2) {
$common++;
$read_1 = $read_2 = 1;
} else {
if ($key_1 lt $key_2) {
print "F1: $key_1\n";
$file_1++;
$read_1++;
} else {
print "F2: $key_2\n";
$file_2++;
$read_2++;
}
}

# read from file with lower key value, or both if equal.

if ($read_1) {
$read_1 = 0;
$_ = ;
if (eof) {
last if eof F2;
$_ = $high;
}
# $key_1 = join (':',(split /\|/)[0,1,3]);
$key_1 = sprintf ("%09d:%09d:%09d",(split /\|/)[0,1,3]);
# print "1: $key_1\n";
}

if ($read_2) {
$read_2 = 0;
$_ = ;
if (eof) {
last if eof F1;
$_ = $high;
}
# $key_2 = join (':',(split /\|/)[0,1,3]);
$key_2 = sprintf ("%09d:%09d:%09d",(split /\|/)[0,1,3]);
# print "2: $key_2\n";
}
}

print "\n$common common records.\n";
print "$file_1 only in $f1\n" if $file_1;
print "$file_2 only in $f2\n" if $file_2;
print "No differences found\n" unless ($file_1 or $file_2);