Operating System - HP-UX
1834133 Members
1876 Online
110064 Solutions
New Discussion

Re: awk parsing 2 files help

 
Florinda Adato
Occasional Advisor

awk parsing 2 files help

Hi,

I have two big files, what I want is to get those fields that match from my 1st and 2nd files and those that did not match.

File1:

xxx 10 hello
yyy 20 hello
xxx 20 hello

File2:

xxx thanks 10
xxx please 20
zzz thanks 10

OUTPUT:

xxx 10 thanks hello
xxx 20 please hello
zzz 10 thanks hello
zzz 10 thanks
yyy 20 hello


Fields 1 and 2 of file1 should match fields 2 and 4 of file2.

thanks.


14 REPLIES 14
Florinda Adato
Occasional Advisor

Re: awk parsing 2 files help

ooops, output should be:

xxx 10 thanks hello
xxx 20 please hello
zzz 10 thanks _____
yyy 20 ______ hello
Michael Schulte zur Sur
Honored Contributor

Re: awk parsing 2 files help

Hi,

can you clarify this a bit more? I still don't know, what you want.

Michael
Leif Halvarsson_2
Honored Contributor

Re: awk parsing 2 files help

Hi,
Have a look at the "join" command instead, it matches fields from two files and print out selected fields from both of the files.
Graham Cameron_1
Honored Contributor

Re: awk parsing 2 files help

Not sure what you're trying to do either, but I don't think awk is the tool to compare 2 large files.
If "join", as suggested, is no good, try the man pages for "comm", "uniq".

-- Graham
Computers make it easier to do a lot of things, but most of the things they make it easier to do don't need to be done.
Hoefnix
Honored Contributor

Re: awk parsing 2 files help

The question looks almost the same as a previous one of you, modify the solution from that could do the trick, if I understand the question correct.
http://forums1.itrc.hp.com/service/forums/questionanswer.do?threadId=372886

Regrards,
Peter
Hein van den Heuvel
Honored Contributor

Re: awk parsing 2 files help


> I have two big files

Define big! for less than 10MB or so I would definitely just write a PERL (not awk!) script that remembers all lines and columns to print them (optionall sorted) out after all is read. For an example see below.

For file larger then 1000MB you would need to pre-sort and do a classic merge join.
(read one, read other untill larger than one, read one untill larger then other and so on.). That is readily done with awk (as long as the input is sorted, unlike your example!).

While you sort, or in addition to sort, you could perhpas re-arrange the join fields such that the standard join tool can do the final work.

hth,
Hein.


>>>> Fields 1 and 2 of file1 should match fields 2 and 4 of file2.

You meant 1 and 2 matching 1 and 3 right?


open (FILE, "while () {
chop;
($k1,$k2,$c) = split;
$x1{$k1." ".$k2} = "------";
$x2{$k1." ".$k2} = $c;
}
close (FILE);

open (FILE, "while () {
chop;
($k1,$c,$k2) = split;
$x1{$k1." ".$k2} = $c;
$x2{$k1." ".$k2} = "------" unless ($x2{$k1." ".$k2});
}
foreach $k (sort keys %x1) {
print "$k $x1{$k} $x2{$k}\n";
}

xxx 10 thanks hello
xxx 20 please hello
yyy 20 ------ hello
zzz 10 thanks ------


without the sort in the foreach you'd get:

xxx 10 thanks hello
zzz 10 thanks ------
yyy 20 ------ hello
xxx 20 please hello


Elmar P. Kolkman
Honored Contributor

Re: awk parsing 2 files help

Ok, let's see if I understand. Both files are big, so reading into memory like we did with the AWK solution the previous time is not an option. So we need to find another solution.

Next, the order of the output. Should it be aphabetically ordered or is the order unimportant? If so, I could think of a nice solution, so please give more info on this.

And, to prevent procura to 'nag' about the solution I've in mind, empty lines can be ignored? Only lines with 3 fields are important/exist?
Every problem has at least one solution. Only some solutions are harder to find.
Elmar P. Kolkman
Honored Contributor

Re: awk parsing 2 files help

One more thing: is it possible you have multiple combinations of field 1 and field 2 or field 1 and field 3 in the files, for instance in File1:
xxx 10 hello
xxx 20 hello
xxx 10 bybye
yyy 10 oopsy

Or in File2:
xxx thanks 10
xxx please 10
xxx please 20

That way all solutions with putting 1 of the files in memory will fail, and a new solution should be written.

Every problem has at least one solution. Only some solutions are harder to find.
Florinda Adato
Occasional Advisor

Re: awk parsing 2 files help

Hi Elmar,

Sorry for the late response... Let me clarify my question.

File1 is the main file meaning every rows from this file will be part of the output, for example:

File1:
xxx thanks 10
xxx thanks 20
yyy please 10
zzz help 10

File1, fields 1 and 3 have to be matched with File2 fields 1 and 2. Those that matched will have an another field which came from File2. So if File2 contents are:

xxx 10 hello
xxx 20 hello
zzz 10 ok
zzz 20 ok

Then, if the fields did not matched then I have to put a default field of "notmatched", the output will be:

10 xxx thanks hello
20 xxx thanks hello
10 yyy please notmatched
10 zzz help notmatched


I hope this time, I'm clear enough.

Thank you very much for the help. The first solution you gave me was really great and it made my script really fast. :-)

Michael Schulte zur Sur
Honored Contributor

Re: awk parsing 2 files help

Hi,

try the attachment.

Michael
Elmar P. Kolkman
Honored Contributor

Re: awk parsing 2 files help

Well, let's see if I can come up with a solution. But first a notice: your output order has changed...

Now for the solution. What I suggest is to combine the files again, but this time we do it a bit different. I have not tested this on large files, but the script would become:

( awk '{printf "1 %s %s %s",$1,$3,$2}' < File2 ; awk '{printf "2 %s %s %s",$1,$2,$3}' < File1 ) | sort -k 2,3 -k 1 | awk '$1=="1" { last1=$2;last2=$3;last3=$4 }
$1=="2" { if ($2==last1 && $3==last2)
{ printf "%s %s %s %s\n",$2,$3,$4,last3 }
else {printf "%s %s %s NOTMATCHED\n",$2,$3,$4}}'

Every problem has at least one solution. Only some solutions are harder to find.
Leif Halvarsson_2
Honored Contributor

Re: awk parsing 2 files help

Hi,
The problem, as it is described, can be much simplified with some pre-processing of the data. By reordering and merging the matching fields to one field in each file you can do a simple join and thed split the fields in the output. Try the following:

awk '{ printf "%s#%s %s\n", $1, $3, $2 }' xxx | sort >xxx1
awk '{ printf "%s#%s %s\n", $1, $2, $3 }' yyy | sort >yyy1
join -1 1 -2 1 -o 1.1,1.2,2.2 -a 1 xxx1 yyy1 | tr "#" " "



It is not a final solution but may give you some ideas.
Hein van den Heuvel
Honored Contributor

Re: awk parsing 2 files help


Bah humbug.

This is a completely different requirement description from the initial:

> ooops, output should be:
>
> xxx 10 thanks hello
> xxx 20 please hello
> zzz 10 thanks _____
> yyy 20 ______ hello

That line 'zzz' could have only originated from file 2.

Now you tell us that file 1 is a driver, and the 'unmatched' can only appear in the last output column.

Much simpler! Boring even, and essentially answerred in all prior replies.

Kindly ask the rigth question and study the replies!

Cheers,
Hein.


Elmar P. Kolkman
Honored Contributor

Re: awk parsing 2 files help

Hein has a point, though it's not that difficult to implement the original specs in the AWK solution. Matter of keeping track if the last line that came from File2 has been used for output.
Every problem has at least one solution. Only some solutions are harder to find.