Operating System - Linux
1748150 Members
3674 Online
108758 Solutions
New Discussion юеВ

Re: merge files and match columns

 
SOLVED
Go to solution
Mark H Smith
Advisor

merge files and match columns

I have two tab separated data sources I want to merge together. Both files contain matching first columns, however in one file there are extra items throughout that throw the alignment off when sorted.

#:::::::::::::::::::::::::::::::::::::::::::::::
#:::: File1 (en -> cn)
#:::::::::::::::::::::::::::::::::::::::::::::::
Zone 14 щНЦхУДчЕЩ 14
Zone 13 щНЦхУДчЕЩ 13
Zone 12 щНЦхУДчЕЩ 12
Zone 11 щНЦхУДчЕЩ 11
Zone 10 щНЦхУДчЕЩ 10
ZWZimbabwe ZWхи▓хДПхПНчФпхЖочЕв
ZV Audio ZV щЧКцК╜юХ╢
Zoom Video Vol ч╝ВтХВцЦБчСЩхЧЫюХ╢щЧКцК╜хЩ║
Zoom Video Mute ч╝ВтХВцЦБчСЩхЧЫюХ╢щЧИцмУчЕ╢
Zoom, Up and In ч╝ВтХВцЦБщФЫх▒╜цВЬц╢УхйВцЛ░щПАцГзхБЙ
Zoom to: щлКхЙзуБЪхзгцЦ╛ч╖е:
Zoom to full screen ч╝ВтХВцЦБщС╖хЖ▓хППчБЮ
Zoom to 50% ч╝ВтХВцЦБщС╖?50%
Zoom to 100% ч╝ВтХВцЦБщС╖?100%
Zoom Timeline Out ч╝ВтХБчЪмщПГхаХцг┐ч╗╛
Zoom Timeline In щПАцГзхБЙщПГхаХцг┐ч╗╛?


#:::::::::::::::::::::::::::::::::::::::::::::::
#:::: FILE 2 (en -> jp)
#:::::::::::::::::::::::::::::::::::::::::::::::
Zone 13 щЧЛци║чЕЩ 13
Zone 12 щЧЛци║чЕЩ 12
Zone 11 щЧЛци║чЕЩ 11
Zone 10 щЧЛци║чЕЩ 10
Zone 9 щЧЛци║чЕЩ 9
Zone 8 щЧЛци║чЕЩ 8
Zone 7 щЧЛци║чЕЩ 7
Z-Order щЦ▓х╢Гф╝ГщЧЛ
Z-Order: щЦ▓х╢Гф╝ГщЧЛ?
Zoom Video Vol щКИцБТхЕЧщКЙ?щКЙцТ▒хДСщКИюБИчЕ╢щЦ▓
Zoom Video Mute щКИцБТхЕЧщКЙ?щКЙцТ▒хДСщКИ?щКЙчЖ┤хД▒щКЙчм║хДУ
Zoom to selection. щНКх╢Зх╖╝щКИцОЧф╝ХщО╢чВкф╗ЪщКИхмитВм
Zoom To: щО╕хЫзчХ╛щНКх╢Зх╖╝:
Zoom to Range ч╗бхЛлц┤╕щКЗцРЮхБдщКЙчм║хДм
Zoom to Page Width щКИюЕЬхДпщКЙчВ╜хДЫщКИх╣┐ф╝ДщкЮхСлф╝АчРЫуДзхБв
Zoom Tool щКИцБТхЕЧщКЙ?щКЙхЛйхЕЧщКЙ
Zoom to selection. щНКх╢Зх╖╝щКИцОЧф╝ХщО╢чВкф╗ЪщКИхмитВм
Zoom to чРЫуДзхБвч╗бхЛлц┤╕
Zoom To: щО╕хЫзчХ╛щНКх╢Зх╖╝:
Zoom To: щКИцБТхЕЧщКЙчК▓тВмх╢Зх╖╝:
Zoom to Range ч╗бхЛлц┤╕щКЗцРЮхБдщКЙчм║хДм
Zoom to Page Width щКИюЕЬхДпщКЙчВ╜хДЫщКИх╣┐ф╝ДщкЮхСлф╝АчРЫуДзхБв
Zoom Tool щКИцБТхЕЧщКЙ?щКЙхЛйхЕЧщКЙ
Zoom to Full Page щКЙц░ехЕЧщКИчи┐хППц╡гцТ▒хА░чРЫуДзхБв
Zoom to 50% щКИцБТхЕЧщКЙ?(50%)
Zoom to 100% щКИцБТхЕЧщКЙ?(100%)
Zoom Timeline Out щКИшгдхБЖщКЙчКохД╡щКИхВЦхЕВщКЗюЖ╛х║ЬчБПх┐ЪуАГч╗А?
Zoom Timeline In щКИшгдхБЖщКЙчКохД╡щКИхВЦхЕВщКЗюЖ╜хлЫц╛╢шзеуДз ?


So I want to end up with an english sentence in column 1, and its translated values (in Japanese & Chinese) in columns 2 & 3 - all on one line. Ideally I'd end up with all three languages on one line, but as you can see, to begin with the files are different and I can't always expect to have data in every language.

#:::::::::::::::::::::::::::::::::::::::::::::::
#:::: NewFile (en -> cn -> jp)
#:::::::::::::::::::::::::::::::::::::::::::::::
Zone 14 щНЦхУДчЕЩ 14
Zone 13 щНЦхУДчЕЩ 13 щЧЛци║чЕЩ 13
Zone 12 щНЦхУДчЕЩ 12 щЧЛци║чЕЩ 12
Zone 11 щНЦхУДчЕЩ 11 щЧЛци║чЕЩ 11
Zone 10 щНЦхУДчЕЩ 10 щЧЛци║чЕЩ 10
Zone 9 щЧЛци║чЕЩ 9
Zone 8 щЧЛци║чЕЩ 8
Zone 7 щЧЛци║чЕЩ 7
ZWZimbabwe ZWхи▓хДПхПНчФпхЖочЕв
ZV Audio ZV щЧКцК╜юХ╢
Zoom Video Vol ч╝ВтХВцЦБчСЩхЧЫюХ╢щЧКцК╜хЩ║ щКИцБТхЕЧщКЙ?щКЙцТ▒хДСщКИюБИчЕ╢щЦ▓
Zoom Video Mute ч╝ВтХВцЦБчСЩхЧЫюХ╢щЧИцмУчЕ╢ щКИцБТхЕЧщКЙ?щКЙцТ▒хДСщКИ?щКЙчЖ┤хД▒щКЙчм║хДУ


Rather than manually aligning the content in an editor, here is my idea: run a script that

Compares line by line:
1. Reads the first column of File1
2. Looks for a match in File2; grabs the matching line, sending it to NewFile.



9 REPLIES 9
Muthukumar_5
Honored Contributor

Re: merge files and match columns

First column of File1 will sometimes unique one as,
Zone. We can get upto and look for a match in File2.

while read line; do

match=$(echo $line | sed -e "s/.*//")
grep $match >

done <

hth.
Easy to suggest when don't know about the problem!
Hein van den Heuvel
Honored Contributor

Re: merge files and match columns


I had a little trouble with your data as it seems to be unicode no?
Here is one approach in PERL with normal text.
How did you wond the part sorted?
Provide input files as arguments:

perl x.p file1 file2 ...

---- x.p ----
while ($file = shift @ARGV) {
open (FILE, "<$file") or die "Failed to open $file";
while () {
chop;
if (/\s+(<\w+>)/) {
$english{$`}++;
$foreign{$`.$1} = "$1$'";
}
}
}
foreach (sort keys %english) {
printf ("%s\t%s\t%s\n", $_, $foreign{$_.""},$foreign{$_.""

});
}


Here is a waekish attempt on doing unicode with normal libraries. It adds dynamically finding the languages.

while ($file = shift @ARGV) {
open (FILE, "<$file") or die "Failed to open $file";
while () {
chop;
chop;
if (/\s.(<.\w.\w.>.)/) {
$english{$`}++;
$languages{$1}++;
$foreign{$`.$1} = "$1$'";
}
}
}
foreach (sort keys %languages){
$l[$i++]=$_;
}

foreach (sort keys %english) {
printf ("%-50s\t%30s\t%30s\n", $_, $foreign{$_.$l[0]},$foreign{$_.$l[1]});
}

fwiw,
Hein.
Mark H Smith
Advisor

Re: merge files and match columns

Could you explain what the script does, Muthukumar? It's not working for me..


Hein, could you also explain what the script does? It also did not work for me..
Leif Halvarsson_2
Honored Contributor

Re: merge files and match columns

Hi
Have a look at the "join" command. It is very useful for matching lines in two files when there is uneven number of lines or a "one to many relation".
Muthukumar_5
Honored Contributor

Re: merge files and match columns

Try this,

#!/bin/ksh
file1="file1";
file2="file2";
newfile="newfile";

> $newfile

while read line;
do

col=$(echo $line | sed 's/.*//')
loop=0
grep "$col" $file2 | while read nline;
do
echo $line $nline >> ${newfile}
loop=1
done
[[ $loop -eq 0 ]] && echo $line >> ${newfile}

done < file1

# end #

## Check ###

# cat > file2
Zone 13 K:Y 13
Zone 12 K:Y 12
Zone 11 K:Y 11
Zone 10 K:Y 10
Zone 9 K:Y 9
Zone 8 K:Y 8
Zone 7 K:Y 7
# cat file1
Zone 14 VDY 14
Zone 13 VDY 13
Zone 12 VDY 12
Zone 11 VDY 11
Zone 10 VDY 10
# cat scr.sh
file1="file1";
file2="file2";
newfile="newfile";

> $newfile

while read line;
do

col=$(echo $line | sed 's/.*//')
loop=0
grep "$col" $file2 | while read nline;
do
echo $line $nline
loop=1
done
[[ $loop -eq 0 ]] && echo $line

done < file1
# sh scr.sh
Zone 14 VDY 14
Zone 13 VDY 13 Zone 13 K:Y 13
Zone 12 VDY 12 Zone 12 K:Y 12
Zone 11 VDY 11 Zone 11 K:Y 11
Zone 10 VDY 10 Zone 10 K:Y 10

hth.
Easy to suggest when don't know about the problem!
john korterman
Honored Contributor

Re: merge files and match columns

Hi again,

you can try this script, using your input files as $1 and $2:

#!/usr/bin/sh
while read line1
do
F1=$(echo "$line1" | awk -F"" '{print $1}')
F11=$(echo "$line1" | awk -F"" '{print $2}')
while read line2
do
F2=$(echo "$line2" | awk -F"" '{print $1}')
F22=$(echo "$line2" | awk -F"" '{print $2}')
if [ "$F1" = "$F2" ]
then
echo "${F1}${F11}${F22}"
else
continue
fi
done < $2
done < $1


However, it takes years to execute....

regards,
John K.
it would be nice if you always got a second chance
Arturo Galbiati
Esteemed Contributor
Solution

Re: merge files and match columns

Hi Mark,
you can use the join command to solve your problem.
The files ghave to have a filed separator "|" (you can quickly chnage tabs to this by tr or vi) and the second file has to contain all the key. both files have to be sorted.
i.e.:
cat file1
Zone 10|V├Г Y 10
Zone 11|V├Г Y 11
Zone 12|V├Г Y 12
Zone 13|V├Г Y 13

cat file2
Zone 10|├Г :Y 10
Zone 11|├Г :Y 11
Zone 12|├Г :Y 12
Zone 13|├Г :Y 13
Zone 8|├Г :Y 8
Zone 9|├Г :Y 9

join -a2 -t"|" -j1 1 -j2 1 -o 2.1,1.2,2.2 file1 file2
Zone 10|V├Г Y 10|├Г :Y 10
Zone 11|V├Г Y 11|├Г :Y 11
Zone 12|V├Г Y 12|├Г :Y 12
Zone 13|V├Г Y 13|├Г :Y 13
Zone 8||├Г :Y 8
Zone 9||├Г :Y 9

join files by first field keeping records in file2 which haven't key in file1 writing in output file1.field1,file1.field2,file2.filed2

type man join for furhet info

Hope this solve your problem
Art
Hein van den Heuvel
Honored Contributor

Re: merge files and match columns

Mark>> Hein, could you also explain what the script does? It also did not work for me

Sure... I often do, but it was late. I also failed to add to check out 'join' as i intended. That may do the job, but with less control/options.

I suspected problems with unicode chars. I could not nicely 'paste' the data from your topic into a Unix window. Windows accepted it, but suggested to store as unicode. The partially pasted sample does work with my first script on a Unix box. Try that? For better help, and any future questions, be sure to ATTACH the real data (in a txt document?) as the forum munges tabs and spaces.

Anyways.... my first script:

1) loop over input arguments (allows for more than 2 lanugage files some day)
2) open current input file
3) loop through current input file
4) chop to drop newline for future 'append'.
5) if you see whitespace followed by "<", a word, and "> then it looks like a useful line.
6) take everything 'left' of the matched string a key for a list of english words.
7) create a list of translated words using the englisgh key, with the language appended to the key, storign the value for that combo.
8) when all the loops are done, look through all the english words (from whatever input file) and make a print line with the english, the looked-up value for the translated values (if any)!

You may wnat to replace the \s+ with \t for just a tab.

The second example dealt with the files stored as unicode in a crummy way by just treating each 16 bit unicode char as a single char plus '.' = any-char.

It also 'counts' each foreign language usage, with the main intent to just register that language.

After the input loops it then make and inventory of the languages seen and uses those to select the foreing language entries.
Currently hardcoded as just 2, but easily expanded to a loop over more language if that is ever needed.

hth,
Hein.
Mark H Smith
Advisor

Re: merge files and match columns

Many thanks for everyone's replies, answers, suggestions and explanations.

My file was encoded in UTF8 and then pasted from Linux into IE on a Windows box (in order to post here on ITRC). If you're a Chinese or Japanese reader you'll notice the characters don't appear quite right, but you get the general idea I think.

My file is several million lines long and after some testing on this snippet, I think I'm going to try the "join" solution -- very fast.