Using perl to compare large lists ...

A. Daniel King_1 · ‎02-07-2006

Hi, folks.

Against large files, I get "Out of memory!" using the code below. The basic idea is, I want the items in list A which are not in list B for very large lists. Are there options outside of breaking down the files into smaller chunks?

Thanks, all.

#!/usr/bin/perl -w

open (FILE0,$ARGV[0]) || die "Problem with $ARGV[0]\n";
open (FILE1,$ARGV[1]) || die "Problem with $ARGV[1]\n";

while ( defined($item0=) )
{
seek FILE1, 0, 0;
@x=grep { /$item0/ } ;
if ( $#x != 0 ) { print $item0 } # For items in argv0, but not in argv1.
}

close (FILE1);
close (FILE0);

Command-Line Junkie

H.Merijn Brand (procura · ‎02-07-2006

Open the smallest file first, put it in a hash, and use that when traversing the large file

If the small file is still too large, use a tied hash

#!/usr/bin/perl

use strict;
use warnings;

my $f1 = shift or die "usage: $0 file1 file ...\n";
open my $f, "< $f1" or die "$f1: $!\n";
tie my %f1, "DB_File", "/tmp/keys$$";
while (<$f>) {
$f1{$_}++;
}
while (<>) { # The rest of the files
if (exists $f1{$_} {
# This line is also in file0
}
else {
# This is not
}
}
untie %f1;
unlink "/tmp/keys$$";
-->8---

Enjoy, Have FUN! H.Merijn

Enjoy, Have FUN! H.Merijn

James R. Ferguson · ‎02-07-2006

Hi Daniel:

Perhaps, instead of building a list whose elements are the full record, you could build a hash of the keys to the records you want to report. Then, walk the hash and print the items of interest using the keys to seek the full record.

Regards!

...JRF...

Peter Godron · ‎02-07-2006

Hi,
not a perl reply, but did you have a look at "comm" ?

H.Merijn Brand (procura · ‎02-07-2006

warning: though comm (and join) are great for this, they need sorted input files!
That is often a limitation in real life problems

Enjoy, Have FUN! H.Merijn

Enjoy, Have FUN! H.Merijn

Arunvijai_4 · ‎02-07-2006

Hello,

Check this link, it could be helpful

http://www.codecomments.com/archive234-2005-5-497414.html

-Arun

"A ship in the harbor is safe, but that is not what ships are built for"

Arturo Galbiati · ‎02-07-2006

Hi,
why not use:
grep -vf small_file big file

of course file have to be sorted.

HTH,
Art

A. Daniel King_1 · ‎02-13-2006

A good attempt ...

grep: not enough memory

Command-Line Junkie

H.Merijn Brand (procura · ‎02-13-2006

Did you try my first suggestion?

Enjoy, Have FUN! H.Merijn

Enjoy, Have FUN! H.Merijn

Hein van den Heuvel · ‎02-13-2006

Just thinking aloud...

First, Did you try Merijn's first suggestion? I see no 'points' to indicate a usefulness for the answer.

Secondly, is this a once-in-a-lifetime-never-mind-the-processing-time cleanup, or a job to be scheduled frequently?

Third, Really large lists don't just happen.
They have some meaing, often order, often some key, some 'objectness' or record size.
If you describe that to us, then we may be able to help better.

For example, you may be able to build an array of key values and seek addresses such as not to store the whole 'records/blobs' and then later go back.

If they are sorted you can do a classic some from the left, some from the right compare. See below.

How much difference do you expect, is still useful to report?

How about not breaking down the actual list in chunks, but just the processing?
Some algoritme where you read in N records (100,000?) from one file, then start reading the other file and deleting matches untill less then M matches (10,000?).
Now read N-M more from the first file, and continue reading from where you were the second file, or re-start the second file from the start untill again dowm below M matches. Repeat untill at end of first file.
When at at, make one more sweep from second file. Admittedly this is a lot of handwaving, but something along those lines, the details heavily depending on what you know about the files.

Below you'll find a script I made and used to walk two largish parameter value list I needed to compare and report on. This is a case where I knew the params were sorted by a key. So I split the lines from f1 and f2 into their keys k1 and k2 and values v1 and v2. The compare k1 and k2. If equal compare values v1 and v2 and report. If k1k2 then read next f2

The actual script below is probably not useful (unless you are comparing SAP benchmark r3.out files ;-), but the principle may become clearer reading it. (then again, the added processing like 'known to be variable' substitutions' and pretty-printing may confuse the concept beyond recognition :^)

Hope this helps some,
Hein.

#!/bin/perl
#
$f1 = @ARGV[0];
$f2 = @ARGV[1];
$ALL = @ARGV[2];
die "Must provide two R3.out files to compare" unless $f2; open (F1, $f1) || die "Error open file 1: $f1"; open (F2, $f2) || die "Error open file 2: $f2";

# Find system ID and first parameter
while () {
$S1 = $1 if (/SAP System\s+(\w+)\s/);
$I1 = $1 if (/^INSTANCE_NAME\s+$!$ (\w+)/);
last if (/^Param/);
}

while () {
$S2 = $1 if (/SAP System\s+(\w+)\s/);
$I2 = $1 if (/^INSTANCE_NAME\s+$!$ (\w+)/);
last if (/^Param/);
}

$format = "%-30.30s %s %-20s %s %-20s\n"; print "\nColumn \"?\" legend: \"|\" = default, \"X\" = changed, \" \" = missing.\n\n"; printf $format, "Parameter", " ", "$S1 - $I1", " ", "$S2 - $I2"; printf $format, "------------------------------","?","--------------------",
"?","--------------------"; while () {
$v1 = " ";
if (/^(\S+).*( |$!$) (.{1,20})/) {
$k1 = $1;
$d1 = ($2 eq " ") ? "|" : "X";
$v1 = $3;
$v1 =~ s/$S1/{SID}/g;
$v1 =~ s/$I1/{INST}/g;
}
while ($k2 lt $k1) {
last unless ($_ = );
$v2 = " ";
if (/^(\S+).*( |$!$) (.{1,20})/) {
$k2 = $1;
$d2 = ($2 eq " ") ? "|" : "X";
$v2 = $3;
$v2 =~ s/$S2/{SID}/g;
$v2 =~ s/$I2/{INST}/g;
}
if ($k2 lt $k1) {
printf $format, $k2, " ", " ", $d2, $v2;
}
}
if ($k1 eq $k2) {
printf $format, $k1, $d1, $v1, $d2, $v2 if
($ALL || ($v1 ne $v2 && ($d1.$d2 ne "||") ) );
} else {
printf $format, $k1, $d1, $v1, " ", " ";
}
}

Sample output
Column "?" legend: "|" = default, "X" = changed, " " = missing.

Parameter BE2 - D11 RAC - D02
------------------------------ ? -------------------- ? --------------------
DIR_ORAHOME | /oracle/{SID} X /oracle/{SID}/901_64
ES/SHM_MAX_PRIV_SEGS X 63 | 2
ES/TABLE X SHM_SEGS | UNIX_STD
SAPDBHOST X b009 X sapsd

A. Daniel King_1 · ‎02-13-2006

I've not yet installed the module which will give me "tie" ... and will award points as I evaluate the answers ...

Command-Line Junkie

H.Merijn Brand (procura · ‎02-13-2006

"tie" is no `module'. It's a standard perl builtin mechanism.

"DB_File" is a module, but it comes with the CORE perl distribution, so whatever build you use, should support it.

# perldoc -f tie

Enjoy, Have FUN! H.Merijn

Enjoy, Have FUN! H.Merijn

Steven M Evans · ‎02-22-2006

I work with Daniel and I think I cooked up an answer. I'm posting it for 2 reasons:

1) To make sure my logic is correct.

2) For posterity. I dredged many an archive for something similar with no luck.

I didn't see the 'tie' suggestion before doing this, so I'm not sure how my solution compares. The size of the lists are 14 million each, so I wanted to go through each file as few times as possible. My code goes through each file once, but the inputs must be sorted. It steps through each file one line at a time and does an alphanumeric comparison. If they are the same, the line is put in the 'in_both' file. If the line from 0 is less than the line from 1, it must not be in file 1, so itâ s put in the 'in_0_only' and vice versa. Run time was 3-7 minutes.

---(Preliminary stuff)
# Churn, baby, churn.
#---------------

# Read first lines.
unless ( defined($record0 = ) ) { die "ERROR: $ARGV[0] is empty.\n"; }
unless ( defined($record1 = ) ) { die "ERROR: $ARGV[1] is empty.\n"; }

while ( $eof0 == 0 && $eof1 == 0 )
{
$need_next0 = 0 ;
$need_next1 = 0 ;

if ( $record0 eq $record1 )
{
print INBOTH $record0 ;
$need_next0 = 1 ;
$need_next1 = 1 ;
} else {
if ( $record0 lt $record1 )
{
print INFILE0 $record0 ;
$need_next0 = 1;
} else {
print INFILE1 $record1 ;
$need_next1 = 1;
}
}

if ( $need_next0 == 1 )
{ unless ( defined($record0 = ) ) { $eof0 = 1; } }
if ( $need_next1 == 1 )
{ unless ( defined($record1 = ) ) { $eof1 = 1; } }
}

# Cleanup the rest of the file.
#---------------
until ( $eof0 == 1 )
{
print INFILE0 $record0 ;
unless ( defined($record0 = ) ) { $eof0 = 1; }
}

until ( $eof1 == 1 )
{
print INFILE1 $record1 ;
unless ( defined($record1 = ) ) { $eof1 = 1; }
}
---

This is this.

A. Daniel King_1 · ‎05-22-2006

It seems I need a good book on algorithms.

Thanks, Steve, and everyone else who gave a shot!

Command-Line Junkie

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Using perl to compare large lists ...

Using perl to compare large lists ...

Re: Using perl to compare large lists ...

Re: Using perl to compare large lists ...

Re: Using perl to compare large lists ...

Re: Using perl to compare large lists ...

Re: Using perl to compare large lists ...

Re: Using perl to compare large lists ...

Re: Using perl to compare large lists ...

Re: Using perl to compare large lists ...

Re: Using perl to compare large lists ...

Re: Using perl to compare large lists ...

Re: Using perl to compare large lists ...

Re: Using perl to compare large lists ...

Re: Using perl to compare large lists ...