topic Re: awk parsing 2 files help in Operating System - HP-UX

awk parsing 2 files help

Florinda Adato — Mon, 02 Feb 2004 06:17:18 GMT

Hi,

I have two big files, what I want is to get those fields that match from my 1st and 2nd files and those that did not match.

File1:

xxx 10 hello
yyy 20 hello
xxx 20 hello

File2:

xxx thanks 10
xxx please 20
zzz thanks 10

OUTPUT:

xxx 10 thanks hello
xxx 20 please hello
zzz 10 thanks hello
zzz 10 thanks
yyy 20 hello

Fields 1 and 2 of file1 should match fields 2 and 4 of file2.

thanks.

Re: awk parsing 2 files help

Florinda Adato — Mon, 02 Feb 2004 06:24:10 GMT

ooops, output should be:

xxx 10 thanks hello
xxx 20 please hello
zzz 10 thanks _____
yyy 20 ______ hello

Re: awk parsing 2 files help

Michael Schulte zur Sur — Mon, 02 Feb 2004 06:39:01 GMT

Hi,

can you clarify this a bit more? I still don't know, what you want.

Michael

Re: awk parsing 2 files help

Leif Halvarsson_2 — Mon, 02 Feb 2004 07:08:39 GMT

Hi,
Have a look at the "join" command instead, it matches fields from two files and print out selected fields from both of the files.

Re: awk parsing 2 files help

Graham Cameron_1 — Mon, 02 Feb 2004 08:08:06 GMT

Not sure what you're trying to do either, but I don't think awk is the tool to compare 2 large files.
If "join", as suggested, is no good, try the man pages for "comm", "uniq".

-- Graham

Re: awk parsing 2 files help

Hoefnix — Mon, 02 Feb 2004 08:30:41 GMT

The question looks almost the same as a previous one of you, modify the solution from that could do the trick, if I understand the question correct.
http://forums1.itrc.hp.com/service/forums/questionanswer.do?threadId=372886

Regrards,
Peter

Re: awk parsing 2 files help

Hein van den Heuvel — Mon, 02 Feb 2004 11:55:54 GMT

> I have two big files

Define big! for less than 10MB or so I would definitely just write a PERL (not awk!) script that remembers all lines and columns to print them (optionall sorted) out after all is read. For an example see below.

For file larger then 1000MB you would need to pre-sort and do a classic merge join.
(read one, read other untill larger than one, read one untill larger then other and so on.). That is readily done with awk (as long as the input is sorted, unlike your example!).

While you sort, or in addition to sort, you could perhpas re-arrange the join fields such that the standard join tool can do the final work.

hth,
Hein.

>>>> Fields 1 and 2 of file1 should match fields 2 and 4 of file2.

You meant 1 and 2 matching 1 and 3 right?

open (FILE, "while () {
chop;
($k1,$k2,$c) = split;
$x1{$k1." ".$k2} = "------";
$x2{$k1." ".$k2} = $c;
}
close (FILE);

open (FILE, "while () {
chop;
($k1,$c,$k2) = split;
$x1{$k1." ".$k2} = $c;
$x2{$k1." ".$k2} = "------" unless ($x2{$k1." ".$k2});
}
foreach $k (sort keys %x1) {
print "$k $x1{$k} $x2{$k}\n";
}

xxx 10 thanks hello
xxx 20 please hello
yyy 20 ------ hello
zzz 10 thanks ------

without the sort in the foreach you'd get:

xxx 10 thanks hello
zzz 10 thanks ------
yyy 20 ------ hello
xxx 20 please hello

Re: awk parsing 2 files help

Elmar P. Kolkman — Tue, 03 Feb 2004 02:47:28 GMT

Ok, let's see if I understand. Both files are big, so reading into memory like we did with the AWK solution the previous time is not an option. So we need to find another solution.

Next, the order of the output. Should it be aphabetically ordered or is the order unimportant? If so, I could think of a nice solution, so please give more info on this.

And, to prevent procura to 'nag' about the solution I've in mind, empty lines can be ignored? Only lines with 3 fields are important/exist?

Re: awk parsing 2 files help

Elmar P. Kolkman — Wed, 04 Feb 2004 01:42:45 GMT

One more thing: is it possible you have multiple combinations of field 1 and field 2 or field 1 and field 3 in the files, for instance in File1:
xxx 10 hello
xxx 20 hello
xxx 10 bybye
yyy 10 oopsy

Or in File2:
xxx thanks 10
xxx please 10
xxx please 20

That way all solutions with putting 1 of the files in memory will fail, and a new solution should be written.

Re: awk parsing 2 files help

Florinda Adato — Wed, 04 Feb 2004 02:37:18 GMT

Hi Elmar,

Sorry for the late response... Let me clarify my question.

File1 is the main file meaning every rows from this file will be part of the output, for example:

File1:
xxx thanks 10
xxx thanks 20
yyy please 10
zzz help 10

File1, fields 1 and 3 have to be matched with File2 fields 1 and 2. Those that matched will have an another field which came from File2. So if File2 contents are:

xxx 10 hello
xxx 20 hello
zzz 10 ok
zzz 20 ok

Then, if the fields did not matched then I have to put a default field of "notmatched", the output will be:

10 xxx thanks hello
20 xxx thanks hello
10 yyy please notmatched
10 zzz help notmatched

I hope this time, I'm clear enough.

Thank you very much for the help. The first solution you gave me was really great and it made my script really fast. :-)

Re: awk parsing 2 files help

Michael Schulte zur Sur — Wed, 04 Feb 2004 04:07:26 GMT

Hi,

try the attachment.

Michael

Re: awk parsing 2 files help

Elmar P. Kolkman — Wed, 04 Feb 2004 05:18:08 GMT

Well, let's see if I can come up with a solution. But first a notice: your output order has changed...

Now for the solution. What I suggest is to combine the files again, but this time we do it a bit different. I have not tested this on large files, but the script would become:

( awk '{printf "1 %s %s %s",$1,$3,$2}' < File2 ; awk '{printf "2 %s %s %s",$1,$2,$3}' < File1 ) | sort -k 2,3 -k 1 | awk '$1=="1" { last1=$2;last2=$3;last3=$4 }
$1=="2" { if ($2==last1 && $3==last2)
{ printf "%s %s %s %s\n",$2,$3,$4,last3 }
else {printf "%s %s %s NOTMATCHED\n",$2,$3,$4}}'

Re: awk parsing 2 files help

Leif Halvarsson_2 — Wed, 04 Feb 2004 05:42:33 GMT

Hi,
The problem, as it is described, can be much simplified with some pre-processing of the data. By reordering and merging the matching fields to one field in each file you can do a simple join and thed split the fields in the output. Try the following:

awk '{ printf "%s#%s %s\n", $1, $3, $2 }' xxx | sort >xxx1
awk '{ printf "%s#%s %s\n", $1, $2, $3 }' yyy | sort >yyy1
join -1 1 -2 1 -o 1.1,1.2,2.2 -a 1 xxx1 yyy1 | tr "#" " "

It is not a final solution but may give you some ideas.

Re: awk parsing 2 files help

Hein van den Heuvel — Wed, 04 Feb 2004 10:07:40 GMT

Bah humbug.

This is a completely different requirement description from the initial:

> ooops, output should be:
>
> xxx 10 thanks hello
> xxx 20 please hello
> zzz 10 thanks _____
> yyy 20 ______ hello

That line 'zzz' could have only originated from file 2.

Now you tell us that file 1 is a driver, and the 'unmatched' can only appear in the last output column.

Much simpler! Boring even, and essentially answerred in all prior replies.

Kindly ask the rigth question and study the replies!

Cheers,
Hein.

Re: awk parsing 2 files help

Elmar P. Kolkman — Thu, 05 Feb 2004 01:13:39 GMT

Hein has a point, though it's not that difficult to implement the original specs in the AWK solution. Matter of keeping track if the last line that came from File2 has been used for output.