Re: awk parsing 2 files help

Florinda Adato · ‎02-01-2004

Hi,

I have two big files, what I want is to get those fields that match from my 1st and 2nd files and those that did not match.

File1:

xxx 10 hello
yyy 20 hello
xxx 20 hello

File2:

xxx thanks 10
xxx please 20
zzz thanks 10

OUTPUT:

xxx 10 thanks hello
xxx 20 please hello
zzz 10 thanks hello
zzz 10 thanks
yyy 20 hello

Fields 1 and 2 of file1 should match fields 2 and 4 of file2.

thanks.

Florinda Adato · ‎02-01-2004

ooops, output should be:

xxx 10 thanks hello
xxx 20 please hello
zzz 10 thanks _____
yyy 20 ______ hello

Michael Schulte zur Sur · ‎02-01-2004

Hi,

can you clarify this a bit more? I still don't know, what you want.

Michael

Leif Halvarsson_2 · ‎02-01-2004

Hi,
Have a look at the "join" command instead, it matches fields from two files and print out selected fields from both of the files.

Graham Cameron_1 · ‎02-02-2004

Not sure what you're trying to do either, but I don't think awk is the tool to compare 2 large files.
If "join", as suggested, is no good, try the man pages for "comm", "uniq".

-- Graham

Computers make it easier to do a lot of things, but most of the things they make it easier to do don't need to be done.

Hoefnix · ‎02-02-2004

The question looks almost the same as a previous one of you, modify the solution from that could do the trick, if I understand the question correct.
http://forums1.itrc.hp.com/service/forums/questionanswer.do?threadId=372886

Regrards,
Peter

Hein van den Heuvel · ‎02-02-2004

> I have two big files

Define big! for less than 10MB or so I would definitely just write a PERL (not awk!) script that remembers all lines and columns to print them (optionall sorted) out after all is read. For an example see below.

For file larger then 1000MB you would need to pre-sort and do a classic merge join.
(read one, read other untill larger than one, read one untill larger then other and so on.). That is readily done with awk (as long as the input is sorted, unlike your example!).

While you sort, or in addition to sort, you could perhpas re-arrange the join fields such that the standard join tool can do the final work.

hth,
Hein.

>>>> Fields 1 and 2 of file1 should match fields 2 and 4 of file2.

You meant 1 and 2 matching 1 and 3 right?

open (FILE, "while () {
chop;
($k1,$k2,$c) = split;
$x1{$k1." ".$k2} = "------";
$x2{$k1." ".$k2} = $c;
}
close (FILE);

open (FILE, "while () {
chop;
($k1,$c,$k2) = split;
$x1{$k1." ".$k2} = $c;
$x2{$k1." ".$k2} = "------" unless ($x2{$k1." ".$k2});
}
foreach $k (sort keys %x1) {
print "$k $x1{$k} $x2{$k}\n";
}

xxx 10 thanks hello
xxx 20 please hello
yyy 20 ------ hello
zzz 10 thanks ------

without the sort in the foreach you'd get:

xxx 10 thanks hello
zzz 10 thanks ------
yyy 20 ------ hello
xxx 20 please hello

Elmar P. Kolkman · ‎02-02-2004

Ok, let's see if I understand. Both files are big, so reading into memory like we did with the AWK solution the previous time is not an option. So we need to find another solution.

Next, the order of the output. Should it be aphabetically ordered or is the order unimportant? If so, I could think of a nice solution, so please give more info on this.

And, to prevent procura to 'nag' about the solution I've in mind, empty lines can be ignored? Only lines with 3 fields are important/exist?

Every problem has at least one solution. Only some solutions are harder to find.

Elmar P. Kolkman · ‎02-03-2004

One more thing: is it possible you have multiple combinations of field 1 and field 2 or field 1 and field 3 in the files, for instance in File1:
xxx 10 hello
xxx 20 hello
xxx 10 bybye
yyy 10 oopsy

Or in File2:
xxx thanks 10
xxx please 10
xxx please 20

That way all solutions with putting 1 of the files in memory will fail, and a new solution should be written.

Every problem has at least one solution. Only some solutions are harder to find.

Florinda Adato · ‎02-03-2004

Hi Elmar,

Sorry for the late response... Let me clarify my question.

File1 is the main file meaning every rows from this file will be part of the output, for example:

File1:
xxx thanks 10
xxx thanks 20
yyy please 10
zzz help 10

File1, fields 1 and 3 have to be matched with File2 fields 1 and 2. Those that matched will have an another field which came from File2. So if File2 contents are:

xxx 10 hello
xxx 20 hello
zzz 10 ok
zzz 20 ok

Then, if the fields did not matched then I have to put a default field of "notmatched", the output will be:

10 xxx thanks hello
20 xxx thanks hello
10 yyy please notmatched
10 zzz help notmatched

I hope this time, I'm clear enough.

Thank you very much for the help. The first solution you gave me was really great and it made my script really fast. :-)

Michael Schulte zur Sur · ‎02-03-2004

Hi,

try the attachment.

Michael

Elmar P. Kolkman · ‎02-03-2004

Well, let's see if I can come up with a solution. But first a notice: your output order has changed...

Now for the solution. What I suggest is to combine the files again, but this time we do it a bit different. I have not tested this on large files, but the script would become:

( awk '{printf "1 %s %s %s",$1,$3,$2}' < File2 ; awk '{printf "2 %s %s %s",$1,$2,$3}' < File1 ) | sort -k 2,3 -k 1 | awk '$1=="1" { last1=$2;last2=$3;last3=$4 }
$1=="2" { if ($2==last1 && $3==last2)
{ printf "%s %s %s %s\n",$2,$3,$4,last3 }
else {printf "%s %s %s NOTMATCHED\n",$2,$3,$4}}'

Every problem has at least one solution. Only some solutions are harder to find.

Leif Halvarsson_2 · ‎02-03-2004

Hi,
The problem, as it is described, can be much simplified with some pre-processing of the data. By reordering and merging the matching fields to one field in each file you can do a simple join and thed split the fields in the output. Try the following:

awk '{ printf "%s#%s %s\n", $1, $3, $2 }' xxx | sort >xxx1
awk '{ printf "%s#%s %s\n", $1, $2, $3 }' yyy | sort >yyy1
join -1 1 -2 1 -o 1.1,1.2,2.2 -a 1 xxx1 yyy1 | tr "#" " "

It is not a final solution but may give you some ideas.

Hein van den Heuvel · ‎02-04-2004

Bah humbug.

This is a completely different requirement description from the initial:

> ooops, output should be:
>
> xxx 10 thanks hello
> xxx 20 please hello
> zzz 10 thanks _____
> yyy 20 ______ hello

That line 'zzz' could have only originated from file 2.

Now you tell us that file 1 is a driver, and the 'unmatched' can only appear in the last output column.

Much simpler! Boring even, and essentially answerred in all prior replies.

Kindly ask the rigth question and study the replies!

Cheers,
Hein.

Elmar P. Kolkman · ‎02-04-2004

Hein has a point, though it's not that difficult to implement the original specs in the AWK solution. Matter of keeping track if the last line that came from File2 has been used for output.

Every problem has at least one solution. Only some solutions are harder to find.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: awk parsing 2 files help

awk parsing 2 files help