Re: Script help. Perl perhaps ?

Luke Morgan · ‎06-25-2006

Hi.
I have written the script below to do a simple lookup but now I need to do it on a much larger datafile and it will take an age.
I suspect perl is the way to go to make it faster, but I don't know any perl :(

Could someone help me to translate this script please to save my server days of processing?
Thanks in advance.

The script looks up each line of a data file, compares certain fields with fields from a master file, and outputs an id value from the master file along with certain fields from the data file.

masterfile=/tmp/masterfile
datafile=/tmp/datafile

for b in `cat $datafile`
do

compare=`echo $b | awk 'BEGIN{FS="|"}{data = $2$5$6;print data}END{}' `

for a in `cat $masterfile`
do
mastercompare=`echo $a | awk 'BEGIN{FS="|"}{line = $2$5$6;print line}END{}'`
if [ $mastercompare = $compare ]
then
id=`echo $a | awk 'BEGIN{FS="|"}{print $1}END{}'`
output=`echo $b | awk 'BEGIN{FS="|";OFS="|"}{print $3,$7}END{}'`
echo $id"|"$output >> luke.out
fi
done

done

Peter Nikitka · ‎06-25-2006

Hi,

my suggestion reads the masterfile and the datafile only once, and puts all info in an array (untested) - this should be MUCH faster:

awk -F'|' 'BEGIN { OFS="|";
while ((getline < "/tmp/masterfile") == 1) id[$2$5$6] = $1 }
{ line=$2$5$6; if(id[line]) print (id[line],$3,$7) }' /tmp/datafile >luke.out

mfG Peter

The Universe is a pretty big place, it's bigger than anything anyone has ever dreamed of before. So if it's just us, seems like an awful waste of space, right? Jodie Foster in "Contact"

Peter Nikitka · ‎06-25-2006

Hi,

to be clean, close the masterfile after reading:

awk -F'|' 'BEGIN { OFS="|"
while ((getline < "/tmp/masterfile") == 1) id[$2$5$6] = $1
close ("/tmp/masterfile") }
{ line=$2$5$6; if(id[line]) print (id[line],$3,$7) }' /tmp/datafile >luke.out

mfG Peter

The Universe is a pretty big place, it's bigger than anything anyone has ever dreamed of before. So if it's just us, seems like an awful waste of space, right? Jodie Foster in "Contact"

H.Merijn Brand (procura · ‎06-25-2006

Yes, in this case, perl would be extremely faster.

--8<---
#!/usr/bin/perl

use strict;
use warnings;

my $masterfile = "/tmp/masterfile";
my $datafile = "/tmp/datafile";

open my $out, ">", "luke.out" or die "luke.out: $!\n";

my %master;
open my $mst, "<", $masterfile or die "$masterfile: $!\n";
while (<$mst>) {
my @mst = split /\|/, $_;
$master{join "|", @mst[1,4,5]} = [ @mst[0,2,6] ];
}
close $mst;

open my $dta, "<", $datafile or die "$datafile: $!\n";
while (<$dta>) {
my @dta = split /\|/, $_;
my $cmp = join "|", @dta[1,4,5];
exists $master{$cmp} or next;

my $id = $master{$cmp}[0];
print $out "$id|$dta[2]|$dta[6]\n";
}
close $dta;
close $out;
-->8---

You could even gain a lot more speed if you told us the format of the fields, and change the join "|"'s to pack.

Enjoy, Have FUN! H.Merijn

Enjoy, Have FUN! H.Merijn

H.Merijn Brand (procura · ‎06-25-2006

Peter, I like your awk solution, which is pretty close to what I do in perl, but yours could be safer, if you would include the sep in the key

As we were not told how the data looks like, your script would map both (12, 345, 6789) and (1, 23456, 789) to the same key.

Enjoy, Have FUN! H.Merijn

Enjoy, Have FUN! H.Merijn

Luke Morgan · ‎06-25-2006

Thank you both very much for your suggestions.
I have implemented Peters script and the difference in speed is astonishing!

FYI, the format of the data is this :
$2 is a four digit number
$5 is a two digit number
$6 is a single character

Thanks again

Luke

H.Merijn Brand (procura · ‎06-25-2006

Please bear in mind that last remark from me regarding the generated keys in the awk solution!

--8<--- perl with pack
#!/usr/bin/perl

use strict;
use warnings;

my $masterfile = "/tmp/masterfile";
my $datafile = "/tmp/datafile";

open my $out, ">", "luke.out" or die "luke.out: $!\n";

my %master;
open my $mst, "<", $masterfile or die "$masterfile: $!\n";
while (<$mst>) {
my @mst = split /\|/, $_;
$master{pack "ssA", @mst[1,4,5]} = $mst[0];
}
close $mst;

open my $dta, "<", $datafile or die "$datafile: $!\n";
while (<$dta>) {
my @dta = split /\|/, $_;
my $cmp = pack "ssA", @dta[1,4,5];
exists $master{$cmp} or next;

my $id = $master{$cmp};
print $out "$id|$dta[2]|$dta[6]\n";
}
close $dta;
close $out;
-->8---

Enjoy, Have FUN! H.Merijn

Enjoy, Have FUN! H.Merijn

Peter Nikitka · ‎06-25-2006

Hi,

Procura is totally correct in his remark - to include this in my awk solution simply add the field seperator to the key:

awk -F'|' 'BEGIN { OFS="|"
while ((getline < "/tmp/masterfile") == 1) id[$2$FS$5$FS$6] = $1
close ("/tmp/masterfile") }
{ line=$2$FS$5$FS$6; if(id[line]) print (id[line],$3,$7) }' /tmp/datafile >luke.out

mfG Peter

The Universe is a pretty big place, it's bigger than anything anyone has ever dreamed of before. So if it's just us, seems like an awful waste of space, right? Jodie Foster in "Contact"

Peter Nikitka · ‎06-26-2006

Ups,

I do not complain about additional dollars normally :-).
But you work better here using

id[$2FS$5FS$6] = $1
instead of
id[$2$FS$5$FS$6] = $1

here ..

mfG Peter

The Universe is a pretty big place, it's bigger than anything anyone has ever dreamed of before. So if it's just us, seems like an awful waste of space, right? Jodie Foster in "Contact"

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Script help. Perl perhaps ?

Script help. Perl perhaps ?