topic Re: Script help. Perl perhaps ? in Operating System - Linux

Script help. Perl perhaps ?

Luke Morgan — Mon, 26 Jun 2006 05:22:15 GMT

Hi.
I have written the script below to do a simple lookup but now I need to do it on a much larger datafile and it will take an age.
I suspect perl is the way to go to make it faster, but I don't know any perl :(

Could someone help me to translate this script please to save my server days of processing?
Thanks in advance.

The script looks up each line of a data file, compares certain fields with fields from a master file, and outputs an id value from the master file along with certain fields from the data file.

masterfile=/tmp/masterfile
datafile=/tmp/datafile

for b in `cat $datafile`
do

compare=`echo $b | awk 'BEGIN{FS="|"}{data = $2$5$6;print data}END{}' `

for a in `cat $masterfile`
do
mastercompare=`echo $a | awk 'BEGIN{FS="|"}{line = $2$5$6;print line}END{}'`
if [ $mastercompare = $compare ]
then
id=`echo $a | awk 'BEGIN{FS="|"}{print $1}END{}'`
output=`echo $b | awk 'BEGIN{FS="|";OFS="|"}{print $3,$7}END{}'`
echo $id"|"$output >> luke.out
fi
done

done

Re: Script help. Perl perhaps ?

Peter Nikitka — Mon, 26 Jun 2006 05:59:32 GMT

Hi,

my suggestion reads the masterfile and the datafile only once, and puts all info in an array (untested) - this should be MUCH faster:

awk -F'|' 'BEGIN { OFS="|";
while ((getline < "/tmp/masterfile") == 1) id[$2$5$6] = $1 }
{ line=$2$5$6; if(id[line]) print (id[line],$3,$7) }' /tmp/datafile >luke.out

mfG Peter

Re: Script help. Perl perhaps ?

Peter Nikitka — Mon, 26 Jun 2006 06:01:45 GMT

Hi,

to be clean, close the masterfile after reading:

awk -F'|' 'BEGIN { OFS="|"
while ((getline < "/tmp/masterfile") == 1) id[$2$5$6] = $1
close ("/tmp/masterfile") }
{ line=$2$5$6; if(id[line]) print (id[line],$3,$7) }' /tmp/datafile >luke.out

mfG Peter

Re: Script help. Perl perhaps ?

H.Merijn Brand (procura — Mon, 26 Jun 2006 06:06:48 GMT

Yes, in this case, perl would be extremely faster.

--8<---
#!/usr/bin/perl

use strict;
use warnings;

my $masterfile = "/tmp/masterfile";
my $datafile = "/tmp/datafile";

open my $out, ">", "luke.out" or die "luke.out: $!\n";

my %master;
open my $mst, "<", $masterfile or die "$masterfile: $!\n";
while (<$mst>) {
my @mst = split /\|/, $_;
$master{join "|", @mst[1,4,5]} = [ @mst[0,2,6] ];
}
close $mst;

open my $dta, "<", $datafile or die "$datafile: $!\n";
while (<$dta>) {
my @dta = split /\|/, $_;
my $cmp = join "|", @dta[1,4,5];
exists $master{$cmp} or next;

my $id = $master{$cmp}[0];
print $out "$id|$dta[2]|$dta[6]\n";
}
close $dta;
close $out;
-->8---

You could even gain a lot more speed if you told us the format of the fields, and change the join "|"'s to pack.

Enjoy, Have FUN! H.Merijn

Re: Script help. Perl perhaps ?

H.Merijn Brand (procura — Mon, 26 Jun 2006 06:15:36 GMT

Peter, I like your awk solution, which is pretty close to what I do in perl, but yours could be safer, if you would include the sep in the key

As we were not told how the data looks like, your script would map both (12, 345, 6789) and (1, 23456, 789) to the same key.

Enjoy, Have FUN! H.Merijn

Re: Script help. Perl perhaps ?

Luke Morgan — Mon, 26 Jun 2006 06:22:39 GMT

Thank you both very much for your suggestions.
I have implemented Peters script and the difference in speed is astonishing!

FYI, the format of the data is this :
$2 is a four digit number
$5 is a two digit number
$6 is a single character

Thanks again

Luke

Re: Script help. Perl perhaps ?

H.Merijn Brand (procura — Mon, 26 Jun 2006 06:27:34 GMT

Please bear in mind that last remark from me regarding the generated keys in the awk solution!

--8<--- perl with pack
#!/usr/bin/perl

use strict;
use warnings;

my $masterfile = "/tmp/masterfile";
my $datafile = "/tmp/datafile";

open my $out, ">", "luke.out" or die "luke.out: $!\n";

my %master;
open my $mst, "<", $masterfile or die "$masterfile: $!\n";
while (<$mst>) {
my @mst = split /\|/, $_;
$master{pack "ssA", @mst[1,4,5]} = $mst[0];
}
close $mst;

open my $dta, "<", $datafile or die "$datafile: $!\n";
while (<$dta>) {
my @dta = split /\|/, $_;
my $cmp = pack "ssA", @dta[1,4,5];
exists $master{$cmp} or next;

my $id = $master{$cmp};
print $out "$id|$dta[2]|$dta[6]\n";
}
close $dta;
close $out;
-->8---

Enjoy, Have FUN! H.Merijn

Re: Script help. Perl perhaps ?

Peter Nikitka — Mon, 26 Jun 2006 06:57:28 GMT

Hi,

Procura is totally correct in his remark - to include this in my awk solution simply add the field seperator to the key:

awk -F'|' 'BEGIN { OFS="|"
while ((getline < "/tmp/masterfile") == 1) id[$2$FS$5$FS$6] = $1
close ("/tmp/masterfile") }
{ line=$2$FS$5$FS$6; if(id[line]) print (id[line],$3,$7) }' /tmp/datafile >luke.out

mfG Peter

Re: Script help. Perl perhaps ?

Peter Nikitka — Mon, 26 Jun 2006 10:28:55 GMT

Ups,

I do not complain about additional dollars normally :-).
But you work better here using

id[$2FS$5FS$6] = $1
instead of
id[$2$FS$5$FS$6] = $1

here ..

mfG Peter