Re: merge files and match columns

Mark H Smith · ‎08-30-2005

I have two tab separated data sources I want to merge together. Both files contain matching first columns, however in one file there are extra items throughout that throw the alignment off when sorted.

#:::::::::::::::::::::::::::::::::::::::::::::::
#:::: File1 (en -> cn)
#:::::::::::::::::::::::::::::::::::::::::::::::
Zone 14 鍖哄煙 14
Zone 13 鍖哄煙 13
Zone 12 鍖哄煙 12
Zone 11 鍖哄煙 11
Zone 10 鍖哄煙 10
ZWZimbabwe ZW娲儏反甯冮煢
ZV Audio ZV 闊抽
Zoom Video Vol 缂╂斁瑙嗛闊抽噺
Zoom Video Mute 缂╂斁瑙嗛闈欓煶
Zoom, Up and In 缂╂斁锛屽悜涓婂拰鏀惧偉
Zoom to: 髊剧ず姣斾緥:
Zoom to full screen 缂╂斁鑷冲叏灞
Zoom to 50% 缂╂斁鑷?50%
Zoom to 100% 缂╂斁鑷?100%
Zoom Timeline Out 缂╁皬鏃堕棿绾
Zoom Timeline In 鏀惧偉鏃堕棿绾?

#:::::::::::::::::::::::::::::::::::::::::::::::
#:::: FILE 2 (en -> jp)
#:::::::::::::::::::::::::::::::::::::::::::::::
Zone 13 闋樺煙 13
Zone 12 闋樺煙 12
Zone 11 闋樺煙 11
Zone 10 闋樺煙 10
Zone 9 闋樺煙 9
Zone 8 闋樺煙 8
Zone 7 闋樺煙 7
Z-Order 閲嶃伃闋
Z-Order: 閲嶃伃闋?
Zoom Video Vol 銈恒兗銉?銉撱儑銈煶閲
Zoom Video Mute 銈恒兗銉?銉撱儑銈?銉熴儱銉笺儓
Zoom to selection. 鍊嶇巼銈掗伕鎶炪仚銈嬨€
Zoom To: 鎸囧畾鍊嶇巼:
Zoom to Range 绡勫洸銇搞偤銉笺儬
Zoom to Page Width 銈儯銉炽儛銈广伄骞呫伀琛ㄧ偢
Zoom Tool 銈恒兗銉?銉勩兗銉
Zoom to selection. 鍊嶇巼銈掗伕鎶炪仚銈嬨€
Zoom to 琛ㄧ偢绡勫洸
Zoom To: 鎸囧畾鍊嶇巼:
Zoom To: 銈恒兗銉犲€嶇巼:
Zoom to Range 绡勫洸銇搞偤銉笺儬
Zoom to Page Width 銈儯銉炽儛銈广伄骞呫伀琛ㄧ偢
Zoom Tool 銈恒兗銉?銉勩兗銉
Zoom to Full Page 銉氥兗銈稿叏浣撱倰琛ㄧ偢
Zoom to 50% 銈恒兗銉?(50%)
Zoom to 100% 銈恒兗銉?(100%)
Zoom Timeline Out 銈裤偆銉犮儵銈傖兂銇府灏忚〃绀?
Zoom Timeline In 銈裤偆銉犮儵銈傖兂銇嫛澶觥ㄧ ?

So I want to end up with an english sentence in column 1, and its translated values (in Japanese & Chinese) in columns 2 & 3 - all on one line. Ideally I'd end up with all three languages on one line, but as you can see, to begin with the files are different and I can't always expect to have data in every language.

#:::::::::::::::::::::::::::::::::::::::::::::::
#:::: NewFile (en -> cn -> jp)
#:::::::::::::::::::::::::::::::::::::::::::::::
Zone 14 鍖哄煙 14
Zone 13 鍖哄煙 13 闋樺煙 13
Zone 12 鍖哄煙 12 闋樺煙 12
Zone 11 鍖哄煙 11 闋樺煙 11
Zone 10 鍖哄煙 10 闋樺煙 10
Zone 9 闋樺煙 9
Zone 8 闋樺煙 8
Zone 7 闋樺煙 7
ZWZimbabwe ZW娲儏反甯冮煢
ZV Audio ZV 闊抽
Zoom Video Vol 缂╂斁瑙嗛闊抽噺銈恒兗銉?銉撱儑銈煶閲
Zoom Video Mute 缂╂斁瑙嗛闈欓煶銈恒兗銉?銉撱儑銈?銉熴儱銉笺儓

Rather than manually aligning the content in an editor, here is my idea: run a script that

Compares line by line:
1. Reads the first column of File1
2. Looks for a match in File2; grabs the matching line, sending it to NewFile.

Muthukumar_5 · ‎08-30-2005

First column of File1 will sometimes unique one as,
Zone. We can get upto and look for a match in File2.

while read line; do

match=$(echo $line | sed -e "s/.*//")
grep $match >

done <

hth.

Easy to suggest when don't know about the problem!

Hein van den Heuvel · ‎08-30-2005

I had a little trouble with your data as it seems to be unicode no?
Here is one approach in PERL with normal text.
How did you wond the part sorted?
Provide input files as arguments:

perl x.p file1 file2 ...

---- x.p ----
while ($file = shift @ARGV) {
open (FILE, "<$file") or die "Failed to open $file";
while () {
chop;
if (/\s+(<\w+>)/) {
$english{$`}++;
$foreign{$`.$1} = "$1$'";
}
}
}
foreach (sort keys %english) {
printf ("%s\t%s\t%s\n", $_, $foreign{$_.""},$foreign{$_.""

});
}

Here is a waekish attempt on doing unicode with normal libraries. It adds dynamically finding the languages.

while ($file = shift @ARGV) {
open (FILE, "<$file") or die "Failed to open $file";
while () {
chop;
chop;
if (/\s.(<.\w.\w.>.)/) {
$english{$`}++;
$languages{$1}++;
$foreign{$`.$1} = "$1$'";
}
}
}
foreach (sort keys %languages){
$l[$i++]=$_;
}

foreach (sort keys %english) {
printf ("%-50s\t%30s\t%30s\n", $_, $foreign{$_.$l[0]},$foreign{$_.$l[1]});
}

fwiw,
Hein.

Mark H Smith · ‎08-30-2005

Could you explain what the script does, Muthukumar? It's not working for me..

Hein, could you also explain what the script does? It also did not work for me..

Leif Halvarsson_2 · ‎08-30-2005

Hi
Have a look at the "join" command. It is very useful for matching lines in two files when there is uneven number of lines or a "one to many relation".

Muthukumar_5 · ‎08-30-2005

Try this,

#!/bin/ksh
file1="file1";
file2="file2";
newfile="newfile";

> $newfile

while read line;
do

col=$(echo $line | sed 's/.*//')
loop=0
grep "$col" $file2 | while read nline;
do
echo $line $nline >> ${newfile}
loop=1
done
[[ $loop -eq 0 ]] && echo $line >> ${newfile}

done < file1

# end #

## Check ###

# cat > file2
Zone 13 K:Y 13
Zone 12 K:Y 12
Zone 11 K:Y 11
Zone 10 K:Y 10
Zone 9 K:Y 9
Zone 8 K:Y 8
Zone 7 K:Y 7
# cat file1
Zone 14 VDY 14
Zone 13 VDY 13
Zone 12 VDY 12
Zone 11 VDY 11
Zone 10 VDY 10
# cat scr.sh
file1="file1";
file2="file2";
newfile="newfile";

> $newfile

while read line;
do

col=$(echo $line | sed 's/.*//')
loop=0
grep "$col" $file2 | while read nline;
do
echo $line $nline
loop=1
done
[[ $loop -eq 0 ]] && echo $line

done < file1
# sh scr.sh
Zone 14 VDY 14
Zone 13 VDY 13 Zone 13 K:Y 13
Zone 12 VDY 12 Zone 12 K:Y 12
Zone 11 VDY 11 Zone 11 K:Y 11
Zone 10 VDY 10 Zone 10 K:Y 10

hth.

Easy to suggest when don't know about the problem!

john korterman · ‎08-30-2005

Hi again,

you can try this script, using your input files as $1 and $2:

#!/usr/bin/sh
while read line1
do
F1=$(echo "$line1" | awk -F"" '{print $1}')
F11=$(echo "$line1" | awk -F"" '{print $2}')
while read line2
do
F2=$(echo "$line2" | awk -F"" '{print $1}')
F22=$(echo "$line2" | awk -F"" '{print $2}')
if [ "$F1" = "$F2" ]
then
echo "${F1}${F11}${F22}"
else
continue
fi
done < $2
done < $1

However, it takes years to execute....

regards,
John K.

it would be nice if you always got a second chance

Arturo Galbiati · ‎08-31-2005

Hi Mark,
you can use the join command to solve your problem.
The files ghave to have a filed separator "|" (you can quickly chnage tabs to this by tr or vi) and the second file has to contain all the key. both files have to be sorted.
i.e.:
cat file1
Zone 10|VÃ Y 10
Zone 11|VÃ Y 11
Zone 12|VÃ Y 12
Zone 13|VÃ Y 13

cat file2
Zone 10|Ã :Y 10
Zone 11|Ã :Y 11
Zone 12|Ã :Y 12
Zone 13|Ã :Y 13
Zone 8|Ã :Y 8
Zone 9|Ã :Y 9

join -a2 -t"|" -j1 1 -j2 1 -o 2.1,1.2,2.2 file1 file2
Zone 10|VÃ Y 10|Ã :Y 10
Zone 11|VÃ Y 11|Ã :Y 11
Zone 12|VÃ Y 12|Ã :Y 12
Zone 13|VÃ Y 13|Ã :Y 13
Zone 8||Ã :Y 8
Zone 9||Ã :Y 9

join files by first field keeping records in file2 which haven't key in file1 writing in output file1.field1,file1.field2,file2.filed2

type man join for furhet info

Hope this solve your problem
Art

Hein van den Heuvel · ‎08-31-2005

Mark>> Hein, could you also explain what the script does? It also did not work for me

Sure... I often do, but it was late. I also failed to add to check out 'join' as i intended. That may do the job, but with less control/options.

I suspected problems with unicode chars. I could not nicely 'paste' the data from your topic into a Unix window. Windows accepted it, but suggested to store as unicode. The partially pasted sample does work with my first script on a Unix box. Try that? For better help, and any future questions, be sure to ATTACH the real data (in a txt document?) as the forum munges tabs and spaces.

Anyways.... my first script:

1) loop over input arguments (allows for more than 2 lanugage files some day)
2) open current input file
3) loop through current input file
4) chop to drop newline for future 'append'.
5) if you see whitespace followed by "<", a word, and "> then it looks like a useful line.
6) take everything 'left' of the matched string a key for a list of english words.
7) create a list of translated words using the englisgh key, with the language appended to the key, storign the value for that combo.
8) when all the loops are done, look through all the english words (from whatever input file) and make a print line with the english, the looked-up value for the translated values (if any)!

You may wnat to replace the \s+ with \t for just a tab.

The second example dealt with the files stored as unicode in a crummy way by just treating each 16 bit unicode char as a single char plus '.' = any-char.

It also 'counts' each foreign language usage, with the main intent to just register that language.

After the input loops it then make and inventory of the languages seen and uses those to select the foreing language entries.
Currently hardcoded as just 2, but easily expanded to a loop over more language if that is ever needed.

hth,
Hein.

Mark H Smith · ‎08-31-2005

Many thanks for everyone's replies, answers, suggestions and explanations.

My file was encoded in UTF8 and then pasted from Linux into IE on a Windows box (in order to post here on ITRC). If you're a Chinese or Japanese reader you'll notice the characters don't appear quite right, but you get the general idea I think.

My file is several million lines long and after some testing on this snippet, I think I'm going to try the "join" solution -- very fast.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: merge files and match columns

merge files and match columns