topic Re: Using perl to compare large lists ... in Operating System - Linux

Using perl to compare large lists ...

A. Daniel King_1 — Tue, 07 Feb 2006 08:59:05 GMT

Hi, folks.

Against large files, I get "Out of memory!" using the code below. The basic idea is, I want the items in list A which are not in list B for very large lists. Are there options outside of breaking down the files into smaller chunks?

Thanks, all.

#!/usr/bin/perl -w

open (FILE0,$ARGV[0]) || die "Problem with $ARGV[0]\n";
open (FILE1,$ARGV[1]) || die "Problem with $ARGV[1]\n";

while ( defined($item0=) )
{
seek FILE1, 0, 0;
@x=grep { /$item0/ } ;
if ( $#x != 0 ) { print $item0 } # For items in argv0, but not in argv1.
}

close (FILE1);
close (FILE0);

Re: Using perl to compare large lists ...

H.Merijn Brand (procura — Tue, 07 Feb 2006 09:10:26 GMT

Open the smallest file first, put it in a hash, and use that when traversing the large file

If the small file is still too large, use a tied hash

#!/usr/bin/perl

use strict;
use warnings;

my $f1 = shift or die "usage: $0 file1 file ...\n";
open my $f, "< $f1" or die "$f1: $!\n";
tie my %f1, "DB_File", "/tmp/keys$$";
while (<$f>) {
$f1{$_}++;
}
while (<>) { # The rest of the files
if (exists $f1{$_} {
# This line is also in file0
}
else {
# This is not
}
}
untie %f1;
unlink "/tmp/keys$$";
-->8---

Enjoy, Have FUN! H.Merijn

Re: Using perl to compare large lists ...

James R. Ferguson — Tue, 07 Feb 2006 09:16:04 GMT

Hi Daniel:

Perhaps, instead of building a list whose elements are the full record, you could build a hash of the keys to the records you want to report. Then, walk the hash and print the items of interest using the keys to seek the full record.

Regards!

...JRF...

Re: Using perl to compare large lists ...

Peter Godron — Tue, 07 Feb 2006 09:20:10 GMT

Hi,
not a perl reply, but did you have a look at "comm" ?

Re: Using perl to compare large lists ...

H.Merijn Brand (procura — Tue, 07 Feb 2006 09:25:31 GMT

warning: though comm (and join) are great for this, they need sorted input files!
That is often a limitation in real life problems

Enjoy, Have FUN! H.Merijn

Re: Using perl to compare large lists ...

Arunvijai_4 — Tue, 07 Feb 2006 11:35:23 GMT

Hello,

Check this link, it could be helpful

http://www.codecomments.com/archive234-2005-5-497414.html

-Arun

Re: Using perl to compare large lists ...

Arturo Galbiati — Wed, 08 Feb 2006 04:30:57 GMT

Hi,
why not use:
grep -vf small_file big file

of course file have to be sorted.

HTH,
Art

Re: Using perl to compare large lists ...

A. Daniel King_1 — Mon, 13 Feb 2006 12:40:12 GMT

A good attempt ...

grep: not enough memory

Re: Using perl to compare large lists ...

H.Merijn Brand (procura — Mon, 13 Feb 2006 13:04:24 GMT

Did you try my first suggestion?

Enjoy, Have FUN! H.Merijn

Re: Using perl to compare large lists ...

Hein van den Heuvel — Mon, 13 Feb 2006 13:26:48 GMT

Just thinking aloud...

First, Did you try Merijn's first suggestion? I see no 'points' to indicate a usefulness for the answer.

Secondly, is this a once-in-a-lifetime-never-mind-the-processing-time cleanup, or a job to be scheduled frequently?

Third, Really large lists don't just happen.
They have some meaing, often order, often some key, some 'objectness' or record size.
If you describe that to us, then we may be able to help better.

For example, you may be able to build an array of key values and seek addresses such as not to store the whole 'records/blobs' and then later go back.

If they are sorted you can do a classic some from the left, some from the right compare. See below.

How much difference do you expect, is still useful to report?

How about not breaking down the actual list in chunks, but just the processing?
Some algoritme where you read in N records (100,000?) from one file, then start reading the other file and deleting matches untill less then M matches (10,000?).
Now read N-M more from the first file, and continue reading from where you were the second file, or re-start the second file from the start untill again dowm below M matches. Repeat untill at end of first file.
When at at, make one more sweep from second file. Admittedly this is a lot of handwaving, but something along those lines, the details heavily depending on what you know about the files.

Below you'll find a script I made and used to walk two largish parameter value list I needed to compare and report on. This is a case where I knew the params were sorted by a key. So I split the lines from f1 and f2 into their keys k1 and k2 and values v1 and v2. The compare k1 and k2. If equal compare values v1 and v2 and report. If k1k2 then read next f2

The actual script below is probably not useful (unless you are comparing SAP benchmark r3.out files ;-), but the principle may become clearer reading it. (then again, the added processing like 'known to be variable' substitutions' and pretty-printing may confuse the concept beyond recognition :^)

Hope this helps some,
Hein.

#!/bin/perl
#
$f1 = @ARGV[0];
$f2 = @ARGV[1];
$ALL = @ARGV[2];
die "Must provide two R3.out files to compare" unless $f2; open (F1, $f1) || die "Error open file 1: $f1"; open (F2, $f2) || die "Error open file 2: $f2";

# Find system ID and first parameter
while () {
$S1 = $1 if (/SAP System\s+(\w+)\s/);
$I1 = $1 if (/^INSTANCE_NAME\s+$!$ (\w+)/);
last if (/^Param/);
}

while () {
$S2 = $1 if (/SAP System\s+(\w+)\s/);
$I2 = $1 if (/^INSTANCE_NAME\s+$!$ (\w+)/);
last if (/^Param/);
}

$format = "%-30.30s %s %-20s %s %-20s\n"; print "\nColumn \"?\" legend: \"|\" = default, \"X\" = changed, \" \" = missing.\n\n"; printf $format, "Parameter", " ", "$S1 - $I1", " ", "$S2 - $I2"; printf $format, "------------------------------","?","--------------------",
"?","--------------------"; while () {
$v1 = " ";
if (/^(\S+).*( |$!$) (.{1,20})/) {
$k1 = $1;
$d1 = ($2 eq " ") ? "|" : "X";
$v1 = $3;
$v1 =~ s/$S1/{SID}/g;
$v1 =~ s/$I1/{INST}/g;
}
while ($k2 lt $k1) {
last unless ($_ = );
$v2 = " ";
if (/^(\S+).*( |$!$) (.{1,20})/) {
$k2 = $1;
$d2 = ($2 eq " ") ? "|" : "X";
$v2 = $3;
$v2 =~ s/$S2/{SID}/g;
$v2 =~ s/$I2/{INST}/g;
}
if ($k2 lt $k1) {
printf $format, $k2, " ", " ", $d2, $v2;
}
}
if ($k1 eq $k2) {
printf $format, $k1, $d1, $v1, $d2, $v2 if
($ALL || ($v1 ne $v2 && ($d1.$d2 ne "||") ) );
} else {
printf $format, $k1, $d1, $v1, " ", " ";
}
}

Sample output
Column "?" legend: "|" = default, "X" = changed, " " = missing.

Parameter BE2 - D11 RAC - D02
------------------------------ ? -------------------- ? --------------------
DIR_ORAHOME | /oracle/{SID} X /oracle/{SID}/901_64
ES/SHM_MAX_PRIV_SEGS X 63 | 2
ES/TABLE X SHM_SEGS | UNIX_STD
SAPDBHOST X b009 X sapsd

Re: Using perl to compare large lists ...

A. Daniel King_1 — Mon, 13 Feb 2006 14:26:05 GMT

I've not yet installed the module which will give me "tie" ... and will award points as I evaluate the answers ...

Re: Using perl to compare large lists ...

H.Merijn Brand (procura — Mon, 13 Feb 2006 16:47:01 GMT

"tie" is no `module'. It's a standard perl builtin mechanism.

"DB_File" is a module, but it comes with the CORE perl distribution, so whatever build you use, should support it.

# perldoc -f tie

Enjoy, Have FUN! H.Merijn

Re: Using perl to compare large lists ...

Steven M Evans — Wed, 22 Feb 2006 10:25:34 GMT

I work with Daniel and I think I cooked up an answer. I'm posting it for 2 reasons:

1) To make sure my logic is correct.

2) For posterity. I dredged many an archive for something similar with no luck.

I didn't see the 'tie' suggestion before doing this, so I'm not sure how my solution compares. The size of the lists are 14 million each, so I wanted to go through each file as few times as possible. My code goes through each file once, but the inputs must be sorted. It steps through each file one line at a time and does an alphanumeric comparison. If they are the same, the line is put in the 'in_both' file. If the line from 0 is less than the line from 1, it must not be in file 1, so itâ s put in the 'in_0_only' and vice versa. Run time was 3-7 minutes.

---(Preliminary stuff)
# Churn, baby, churn.
#---------------

# Read first lines.
unless ( defined($record0 = ) ) { die "ERROR: $ARGV[0] is empty.\n"; }
unless ( defined($record1 = ) ) { die "ERROR: $ARGV[1] is empty.\n"; }

while ( $eof0 == 0 && $eof1 == 0 )
{
$need_next0 = 0 ;
$need_next1 = 0 ;

if ( $record0 eq $record1 )
{
print INBOTH $record0 ;
$need_next0 = 1 ;
$need_next1 = 1 ;
} else {
if ( $record0 lt $record1 )
{
print INFILE0 $record0 ;
$need_next0 = 1;
} else {
print INFILE1 $record1 ;
$need_next1 = 1;
}
}

if ( $need_next0 == 1 )
{ unless ( defined($record0 = ) ) { $eof0 = 1; } }
if ( $need_next1 == 1 )
{ unless ( defined($record1 = ) ) { $eof1 = 1; } }
}

# Cleanup the rest of the file.
#---------------
until ( $eof0 == 1 )
{
print INFILE0 $record0 ;
unless ( defined($record0 = ) ) { $eof0 = 1; }
}

until ( $eof1 == 1 )
{
print INFILE1 $record1 ;
unless ( defined($record1 = ) ) { $eof1 = 1; }
}
---

Re: Using perl to compare large lists ...

A. Daniel King_1 — Mon, 22 May 2006 12:54:02 GMT

It seems I need a good book on algorithms.

Thanks, Steve, and everyone else who gave a shot!