Operating System - Linux
1752805 Members
5592 Online
108789 Solutions
New Discussion юеВ

Using perl to compare large lists ...

 
SOLVED
Go to solution
A. Daniel King_1
Super Advisor

Using perl to compare large lists ...

Hi, folks.

Against large files, I get "Out of memory!" using the code below. The basic idea is, I want the items in list A which are not in list B for very large lists. Are there options outside of breaking down the files into smaller chunks?

Thanks, all.

#!/usr/bin/perl -w

open (FILE0,$ARGV[0]) || die "Problem with $ARGV[0]\n";
open (FILE1,$ARGV[1]) || die "Problem with $ARGV[1]\n";

while ( defined($item0=) )
{
seek FILE1, 0, 0;
@x=grep { /$item0/ } ;
if ( $#x != 0 ) { print $item0 } # For items in argv0, but not in argv1.
}

close (FILE1);
close (FILE0);
Command-Line Junkie
13 REPLIES 13
H.Merijn Brand (procura
Honored Contributor

Re: Using perl to compare large lists ...

Open the smallest file first, put it in a hash, and use that when traversing the large file

If the small file is still too large, use a tied hash

#!/usr/bin/perl

use strict;
use warnings;

my $f1 = shift or die "usage: $0 file1 file ...\n";
open my $f, "< $f1" or die "$f1: $!\n";
tie my %f1, "DB_File", "/tmp/keys$$";
while (<$f>) {
$f1{$_}++;
}
while (<>) { # The rest of the files
if (exists $f1{$_} {
# This line is also in file0
}
else {
# This is not
}
}
untie %f1;
unlink "/tmp/keys$$";
-->8---

Enjoy, Have FUN! H.Merijn
Enjoy, Have FUN! H.Merijn
James R. Ferguson
Acclaimed Contributor

Re: Using perl to compare large lists ...

Hi Daniel:

Perhaps, instead of building a list whose elements are the full record, you could build a hash of the keys to the records you want to report. Then, walk the hash and print the items of interest using the keys to seek the full record.

Regards!

...JRF...
Peter Godron
Honored Contributor

Re: Using perl to compare large lists ...

Hi,
not a perl reply, but did you have a look at "comm" ?
H.Merijn Brand (procura
Honored Contributor
Solution

Re: Using perl to compare large lists ...

warning: though comm (and join) are great for this, they need sorted input files!
That is often a limitation in real life problems

Enjoy, Have FUN! H.Merijn
Enjoy, Have FUN! H.Merijn
Arunvijai_4
Honored Contributor

Re: Using perl to compare large lists ...

Hello,

Check this link, it could be helpful

http://www.codecomments.com/archive234-2005-5-497414.html

-Arun
"A ship in the harbor is safe, but that is not what ships are built for"
Arturo Galbiati
Esteemed Contributor

Re: Using perl to compare large lists ...

Hi,
why not use:
grep -vf small_file big file

of course file have to be sorted.

HTH,
Art
A. Daniel King_1
Super Advisor

Re: Using perl to compare large lists ...

A good attempt ...

grep: not enough memory
Command-Line Junkie
H.Merijn Brand (procura
Honored Contributor

Re: Using perl to compare large lists ...

Did you try my first suggestion?

Enjoy, Have FUN! H.Merijn
Enjoy, Have FUN! H.Merijn
Hein van den Heuvel
Honored Contributor

Re: Using perl to compare large lists ...

Just thinking aloud...

First, Did you try Merijn's first suggestion? I see no 'points' to indicate a usefulness for the answer.

Secondly, is this a once-in-a-lifetime-never-mind-the-processing-time cleanup, or a job to be scheduled frequently?

Third, Really large lists don't just happen.
They have some meaing, often order, often some key, some 'objectness' or record size.
If you describe that to us, then we may be able to help better.


For example, you may be able to build an array of key values and seek addresses such as not to store the whole 'records/blobs' and then later go back.

If they are sorted you can do a classic some from the left, some from the right compare. See below.

How much difference do you expect, is still useful to report?

How about not breaking down the actual list in chunks, but just the processing?
Some algoritme where you read in N records (100,000?) from one file, then start reading the other file and deleting matches untill less then M matches (10,000?).
Now read N-M more from the first file, and continue reading from where you were the second file, or re-start the second file from the start untill again dowm below M matches. Repeat untill at end of first file.
When at at, make one more sweep from second file. Admittedly this is a lot of handwaving, but something along those lines, the details heavily depending on what you know about the files.

Below you'll find a script I made and used to walk two largish parameter value list I needed to compare and report on. This is a case where I knew the params were sorted by a key. So I split the lines from f1 and f2 into their keys k1 and k2 and values v1 and v2. The compare k1 and k2. If equal compare values v1 and v2 and report. If k1k2 then read next f2

The actual script below is probably not useful (unless you are comparing SAP benchmark r3.out files ;-), but the principle may become clearer reading it. (then again, the added processing like 'known to be variable' substitutions' and pretty-printing may confuse the concept beyond recognition :^)

Hope this helps some,
Hein.


#!/bin/perl
#
$f1 = @ARGV[0];
$f2 = @ARGV[1];
$ALL = @ARGV[2];
die "Must provide two R3.out files to compare" unless $f2; open (F1, $f1) || die "Error open file 1: $f1"; open (F2, $f2) || die "Error open file 2: $f2";

# Find system ID and first parameter
while () {
$S1 = $1 if (/SAP System\s+(\w+)\s/);
$I1 = $1 if (/^INSTANCE_NAME\s+\(!\) (\w+)/);
last if (/^Param/);
}

while () {
$S2 = $1 if (/SAP System\s+(\w+)\s/);
$I2 = $1 if (/^INSTANCE_NAME\s+\(!\) (\w+)/);
last if (/^Param/);
}

$format = "%-30.30s %s %-20s %s %-20s\n"; print "\nColumn \"?\" legend: \"|\" = default, \"X\" = changed, \" \" = missing.\n\n"; printf $format, "Parameter", " ", "$S1 - $I1", " ", "$S2 - $I2"; printf $format, "------------------------------","?","--------------------",
"?","--------------------"; while () {
$v1 = " ";
if (/^(\S+).*( |\(!\)) (.{1,20})/) {
$k1 = $1;
$d1 = ($2 eq " ") ? "|" : "X";
$v1 = $3;
$v1 =~ s/$S1/{SID}/g;
$v1 =~ s/$I1/{INST}/g;
}
while ($k2 lt $k1) {
last unless ($_ = );
$v2 = " ";
if (/^(\S+).*( |\(!\)) (.{1,20})/) {
$k2 = $1;
$d2 = ($2 eq " ") ? "|" : "X";
$v2 = $3;
$v2 =~ s/$S2/{SID}/g;
$v2 =~ s/$I2/{INST}/g;
}
if ($k2 lt $k1) {
printf $format, $k2, " ", " ", $d2, $v2;
}
}
if ($k1 eq $k2) {
printf $format, $k1, $d1, $v1, $d2, $v2 if
($ALL || ($v1 ne $v2 && ($d1.$d2 ne "||") ) );
} else {
printf $format, $k1, $d1, $v1, " ", " ";
}
}


Sample output
Column "?" legend: "|" = default, "X" = changed, " " = missing.

Parameter BE2 - D11 RAC - D02
------------------------------ ? -------------------- ? --------------------
DIR_ORAHOME | /oracle/{SID} X /oracle/{SID}/901_64
ES/SHM_MAX_PRIV_SEGS X 63 | 2
ES/TABLE X SHM_SEGS | UNIX_STD
SAPDBHOST X b009 X sapsd