Operating System - Linux
1748289 Members
3333 Online
108761 Solutions
New Discussion юеВ

Re: Script help. Perl perhaps ?

 
SOLVED
Go to solution
Luke Morgan
Frequent Advisor

Script help. Perl perhaps ?

Hi.
I have written the script below to do a simple lookup but now I need to do it on a much larger datafile and it will take an age.
I suspect perl is the way to go to make it faster, but I don't know any perl :(

Could someone help me to translate this script please to save my server days of processing?
Thanks in advance.

The script looks up each line of a data file, compares certain fields with fields from a master file, and outputs an id value from the master file along with certain fields from the data file.

masterfile=/tmp/masterfile
datafile=/tmp/datafile

for b in `cat $datafile`
do

compare=`echo $b | awk 'BEGIN{FS="|"}{data = $2$5$6;print data}END{}' `

for a in `cat $masterfile`
do
mastercompare=`echo $a | awk 'BEGIN{FS="|"}{line = $2$5$6;print line}END{}'`
if [ $mastercompare = $compare ]
then
id=`echo $a | awk 'BEGIN{FS="|"}{print $1}END{}'`
output=`echo $b | awk 'BEGIN{FS="|";OFS="|"}{print $3,$7}END{}'`
echo $id"|"$output >> luke.out
fi
done

done
8 REPLIES 8
Peter Nikitka
Honored Contributor
Solution

Re: Script help. Perl perhaps ?

Hi,

my suggestion reads the masterfile and the datafile only once, and puts all info in an array (untested) - this should be MUCH faster:

awk -F'|' 'BEGIN { OFS="|";
while ((getline < "/tmp/masterfile") == 1) id[$2$5$6] = $1 }
{ line=$2$5$6; if(id[line]) print (id[line],$3,$7) }' /tmp/datafile >luke.out

mfG Peter
The Universe is a pretty big place, it's bigger than anything anyone has ever dreamed of before. So if it's just us, seems like an awful waste of space, right? Jodie Foster in "Contact"
Peter Nikitka
Honored Contributor

Re: Script help. Perl perhaps ?

Hi,

to be clean, close the masterfile after reading:


awk -F'|' 'BEGIN { OFS="|"
while ((getline < "/tmp/masterfile") == 1) id[$2$5$6] = $1
close ("/tmp/masterfile") }
{ line=$2$5$6; if(id[line]) print (id[line],$3,$7) }' /tmp/datafile >luke.out

mfG Peter
The Universe is a pretty big place, it's bigger than anything anyone has ever dreamed of before. So if it's just us, seems like an awful waste of space, right? Jodie Foster in "Contact"
H.Merijn Brand (procura
Honored Contributor

Re: Script help. Perl perhaps ?

Yes, in this case, perl would be extremely faster.

--8<---
#!/usr/bin/perl

use strict;
use warnings;

my $masterfile = "/tmp/masterfile";
my $datafile = "/tmp/datafile";

open my $out, ">", "luke.out" or die "luke.out: $!\n";

my %master;
open my $mst, "<", $masterfile or die "$masterfile: $!\n";
while (<$mst>) {
my @mst = split /\|/, $_;
$master{join "|", @mst[1,4,5]} = [ @mst[0,2,6] ];
}
close $mst;

open my $dta, "<", $datafile or die "$datafile: $!\n";
while (<$dta>) {
my @dta = split /\|/, $_;
my $cmp = join "|", @dta[1,4,5];
exists $master{$cmp} or next;

my $id = $master{$cmp}[0];
print $out "$id|$dta[2]|$dta[6]\n";
}
close $dta;
close $out;
-->8---

You could even gain a lot more speed if you told us the format of the fields, and change the join "|"'s to pack.

Enjoy, Have FUN! H.Merijn
Enjoy, Have FUN! H.Merijn
H.Merijn Brand (procura
Honored Contributor

Re: Script help. Perl perhaps ?

Peter, I like your awk solution, which is pretty close to what I do in perl, but yours could be safer, if you would include the sep in the key

As we were not told how the data looks like, your script would map both (12, 345, 6789) and (1, 23456, 789) to the same key.

Enjoy, Have FUN! H.Merijn
Enjoy, Have FUN! H.Merijn
Luke Morgan
Frequent Advisor

Re: Script help. Perl perhaps ?

Thank you both very much for your suggestions.
I have implemented Peters script and the difference in speed is astonishing!

FYI, the format of the data is this :
$2 is a four digit number
$5 is a two digit number
$6 is a single character

Thanks again

Luke
H.Merijn Brand (procura
Honored Contributor

Re: Script help. Perl perhaps ?

Please bear in mind that last remark from me regarding the generated keys in the awk solution!

--8<--- perl with pack
#!/usr/bin/perl

use strict;
use warnings;

my $masterfile = "/tmp/masterfile";
my $datafile = "/tmp/datafile";

open my $out, ">", "luke.out" or die "luke.out: $!\n";

my %master;
open my $mst, "<", $masterfile or die "$masterfile: $!\n";
while (<$mst>) {
my @mst = split /\|/, $_;
$master{pack "ssA", @mst[1,4,5]} = $mst[0];
}
close $mst;

open my $dta, "<", $datafile or die "$datafile: $!\n";
while (<$dta>) {
my @dta = split /\|/, $_;
my $cmp = pack "ssA", @dta[1,4,5];
exists $master{$cmp} or next;

my $id = $master{$cmp};
print $out "$id|$dta[2]|$dta[6]\n";
}
close $dta;
close $out;
-->8---

Enjoy, Have FUN! H.Merijn
Enjoy, Have FUN! H.Merijn
Peter Nikitka
Honored Contributor

Re: Script help. Perl perhaps ?

Hi,

Procura is totally correct in his remark - to include this in my awk solution simply add the field seperator to the key:

awk -F'|' 'BEGIN { OFS="|"
while ((getline < "/tmp/masterfile") == 1) id[$2$FS$5$FS$6] = $1
close ("/tmp/masterfile") }
{ line=$2$FS$5$FS$6; if(id[line]) print (id[line],$3,$7) }' /tmp/datafile >luke.out

mfG Peter
The Universe is a pretty big place, it's bigger than anything anyone has ever dreamed of before. So if it's just us, seems like an awful waste of space, right? Jodie Foster in "Contact"
Peter Nikitka
Honored Contributor

Re: Script help. Perl perhaps ?

Ups,

I do not complain about additional dollars normally :-).
But you work better here using

id[$2FS$5FS$6] = $1
instead of
id[$2$FS$5$FS$6] = $1

here ..

mfG Peter
The Universe is a pretty big place, it's bigger than anything anyone has ever dreamed of before. So if it's just us, seems like an awful waste of space, right? Jodie Foster in "Contact"