Re: grep for asian characters in UTF8 file

Mark H Smith · ‎08-27-2005

I have a UTF-8 text file containing four languages: English, Chinese, Japanese, and Korean.

Each line has 2 columns: an english sentence in the first column and its C,J,or K translation in the second column. (tab separated)

However some lines contain only English and are of no use to me. So I want to locate and discard the english-only lines containing no valid asian data.

cat -A allows me to see certain escape codes in this file, such as tab (^I), and carriage returns (^M$) and also the -A flag shows me that each asian sentence begins with an uppercase M. I'm assuming that represents some code signalling a switch to asian text.

I guess what I am really looking for is a way to grep for asian characters..

D Block 2 · ‎08-27-2005

Mark- how about sending us an example output.. Also, have your tried using the:

cat -v

or have your also tried using:

strings

Also, you might want to verify your Shell Environment variables:

$ set

Just to verify your settings for UTF.

Golf is a Good Walk Spoiled, Mark Twain.

Mark H Smith · ‎08-27-2005

I have attached a snippet of my file.

cat -v gives the same result. (just not sure how to interpret escape codes, if any)

running strings (debian binutils) on it only gives me the english-- no asian characters, just empty space..

Output of locale:

LANG=en_US.utf8
LC_CTYPE="en_US.utf8"
LC_NUMERIC="en_US.utf8"
LC_TIME="en_US.utf8"
LC_COLLATE="en_US.utf8"
LC_MONETARY="en_US.utf8"
LC_MESSAGES="en_US.utf8"
LC_PAPER="en_US.utf8"
LC_NAME="en_US.utf8"
LC_ADDRESS="en_US.utf8"
LC_TELEPHONE="en_US.utf8"
LC_MEASUREMENT="en_US.utf8"
LC_IDENTIFICATION="en_US.utf8"
LC_ALL=en_US.utf8

snippet of my file: (see attachment if this appears garbled)
--------------------------
LS-3108 LS-3108

LS-3114 LS-3114
LS-3108 LS-3108

LS-3114 LS-3114
LS-3108 LS-3108

LS-3114 LS-3114
It must be Easter!
It must be Easter!

It's a 60's shindig...
It's a 60's shindig...

It's a baby bash!

It's a baby bash!
It's a birthday fete! èª ç ä¼ ã ®ã æ¡ å
It's a birthday fete! ì ì ¼ í í °
It's a party! å ã ©ã ä¼ ã ®ã ã ¼ã ã £ã ¼
It's a party! í í °!
It's the 4th of July.

It's the 4th of July.
It's the beginning of post-adolescent neuroses!

It's the beginning of post-adolescent neuroses!
It's time for a shower. çµ å© å¼ äº æ¬¡ä¼ ã ®ã æ¡ å

It's time for a shower. ì²ì²©ì ¥ 3
It's a picnic... ã ã ¯ã ã ã ¯ã «çµ¶å¥½ã ®å£ç¯ ã §ã ...
It's a picnic...
I love my phone! æ ç ±æ ç ç µè¯ !
I love my phone! í ´ë í °ì ´ ë ë¬´ ì¢ ì ì !
Running æ£å ¨è¿ è¡
And I've been head over heels ever since. And I've been head over heels ever since.

And I've gone gaga over you.

file and linker output. í ì ¼ì ë§ ë ¤ê³ ì ì µë ë ¤.

And may you hop with happiness.

And Mom's never forgiven you for it!
Recipient Postal Code æ ¶ä¿¡äººé ®æ ¿ç¼ ç

Recipient qualifier æ ¶ä»¶äººé å® ç¬¦

Recipient queue name length æ ¶ä»¶äººé å å ç§°é ¿åº¦

Recipient Queue æ ¶ä»¶äººé å

Recipient routing address æ ¶ä»¶äººè·¯ç ±å °å

{Recipient's Home Telephone #} {æ ¶ä»¶äººå®¶åºç µè¯ å ·ç }

{Recipient's Name} {æ ¶ä»¶äººå§ å }

{Recipient's Office Location} {æ ¶ä»¶äººå å ¬å °ç ¹}

{Recipient's Office Telephone #} {æ ¶ä»¶äººå å ¬å®¤ç µè¯ å ·ç }
--------------------------

Vibhor Kumar Agarwal · ‎08-28-2005

Will this work:

grep -v "^[a-z,A-Z,tabs,spaces]*$

Vibhor Kumar Agarwal

Mark H Smith · ‎08-28-2005

afraid not.. I still get all those unwanted ASCII-only lines in the output..

Vibhor Kumar Agarwal · ‎08-28-2005

I missed a word "something like this"

Don't know the exact syntax but i think you got the meaning.

Just grep -v all lines which contain only a-z,A-Z,spaces,tabs,nos.

Vibhor Kumar Agarwal

Mark H Smith · ‎08-28-2005

I want to filter out all ASCII-only lines. In order to accomplish this, don't I want to grep for the asian characters, and not those between A-Z?

Vibhor Kumar Agarwal · ‎08-28-2005

Have a look into this:

I think Gnu grep has a option to look for ascii values of characters.

Vibhor Kumar Agarwal

Karthick K S · ‎08-28-2005

Hi Mark,

This is not related to ur question but i need some information regarding korean language setup.

1.If i using ICONV cmd it converting english to Korean but if i am trying thro'Keyboard(I changed Korean fonts in windows),not getting i/p from kbd(only english taking)so pls help me for this

john korterman · ‎08-28-2005

Hi Mark,

perhaps you could instead look for a certain line structure, e.g. lines ending with a certain code not followed by any text.
For instance, based on your attachment, start by discarding lines ending in either "" or "" followed by only a single space
- and afterwards throw away the rest, among other lines, those ending in "LS31" followed by two digits.
If that is an idea, try the below single-line example as a starting point:

# grep -vE ".*[|] $" ./infile| grep -v "LS-31[0-9][0-9]"

regards,
John K.

it would be nice if you always got a second chance

Mark H Smith · ‎08-28-2005

John,

This will help me to trim down the file (maybe by 1/3).. But there are also lines with an english sentence in both columns, such as the one above (hidden in the middle) in my example:

And I've been head over heels ever since. And I've been head over heels ever since.

For whatever reason, lines like that didn't translate to the asian language, so I want to get rid of them.

Mark H Smith · ‎08-29-2005

Karthick,

I couldn't completely understand your english, but hopefully you can understand mine - (Happy to answer your question, even though we're getting a little off-topic here)

Unfortunately there is no freeware utility (that I know of) other than iconv for converting codepages on Windows. The standard korean unix codepage is "euckr", and the standard korean windows codepage is "949". To verify what codepage you're running under, open a CMD window and type 'chcp'.

But I don't use those old codepages, personally I prefer utf8 for everything (locale, input, files, filesystems) since it allows me to mix different languages together.

I think your question was about configuring Windows to input Korean.

You won't need to convert any files with iconv in order to do this:
http://webhard.daewoo.com/jsp/hangul.html

john korterman · ‎08-29-2005

Hi again,

hmmmmm, you have to know the structure and byte representation of the various languages used in your source.....

This is a guess based on a binary transfer of your attachment from which the translated line "it's a picnic" was xd'd to this result:

0000000 3c53 6567 204c 3d45 4e5f 5553 3e49 7427
0000010 7320 6120 7069 636e 6963 2e2e 2e09 3c53
0000020 6567 204c 3d4a 415f 4a50 3ee3 8394 e382
0000030 afe3 838b e383 83e3 82af e381 abe7 b5b6
0000040 e5a5 bde3 81ae e5ad a3e7 af80 e381 a7e3
0000050 8199 2e2e 2e0d 0a00

The English word "picnic" is
7069 636e 6963
in the second line.

My guess is that Japanese in your attachment uses three bytes per character, of which the first byte functions as a sort of classifier, e.g. there seems to be a systematic sequence of two bytes after "e3":
e3 83 94
e3 82 af
e3 83 8b
e3 83 83
e3 82 af
e3 81 ab

followed by these bytes:

e7 b5 b6
e5 a5 bd
e3 81 ae
e5 ad a3
e7 af 80

of which the last part forms an almost rhyme scheme-like structure (haiku?).

The point is of course that you have to know what to look for. You can xd each line and grep for significant, classifying bytes that are used for the languages only, but you have to consult your source in order to find out what that could be.

regards,
John K.

it would be nice if you always got a second chance

Mark H Smith · ‎08-29-2005

Well, I clicked and downloaded my own attachment to try what you did. I then whittled it down to a few raw asian words (3 for each language) and saved them as files. Since my system doesn't come with xd, I used od -h to produce the following:

#Japanese
èª ç ä¼ ã ®ã æ¡ å
0000000 aae8 e795 9f94 bce4 e39a ae81 81e3 e694
0000020 88a1 86e5 0a85
0000026

å ã ©ã ä¼ ã ®ã ã ¼ã ã £ã ¼
0000000 ade5 e390 a981 82e3 e482 9abc 81e3 e3ae
0000020 9183 83e3 e3bc 8683 82e3 e3a3 bc83 000a
0000037

çµ å© å¼ äº æ¬¡ä¼ ã ®ã æ¡ å
0000000 b5e7 e590 9aa9 bce5 e48f 8cba ace6 e4a1
0000020 9abc 81e3 e3ae 9481 a1e6 e588 8586 000a
0000037

#Korean
ì ì ¼ í í °
0000000 83ec ec9d bc9d ed20 8c8c 8bed 0ab0
0000016

í ´ë í °ì ´ ë ë¬´ ì¢ ì ì
0000000 9ced ebb4 808c 8fed ecb0 b49d eb20 8884
0000020 aceb 20b4 a2ec ec8b 8495 9aec 0a94
0000036

ì²ì²©ì ¥
0000000 b2ec ecad a9b2 9eec 0aa5
0000012

#Chinese
æ ç ±æ ç ç µè¯
0000000 88e6 e791 b188 88e6 e791 849a 94e7 e8b5
0000020 9daf 000a
0000023

æ ¶ä¿¡äººé ®æ ¿ç¼ ç
0000000 94e6 e4b6 a1bf bae4 e9ba ae82 94e6 e7bf
0000020 96bc a0e7 0a81
0000026

æ ¶ä»¶äººå å ¬å®¤ç µè¯ å ·ç
0000000 94e6 e4b6 b6bb bae4 e5ba 9e8a 85e5 e5ac
0000020 a4ae 94e7 e8b5 9daf 8fe5 e7b7 81a0 000a
0000037

Although I'm not really sure what to look for.

john korterman · ‎08-29-2005

Hi again,

this is a a typographical arrangement of od -x of the Korean version of "I love my phone!" on my system:

ed 9c b4
eb 8c 80
ed 8f b0
ec 9d b4
20
eb 84 88
eb ac b4
20
ec a2 8b
ec 95 84
ec 9a 94

21 20 0d
0a 00

My guess is that the characters are represented by byte triplets starting with either "eb", "ec", or "ed"; "20" is probably space and "21" the exclamation mark - it might very well be wrong, but you have to ask your supplier.

Since you did not get something similar, I suspect yours was not performed as a binary download.

regards,
John K.

it would be nice if you always got a second chance

Mark H Smith · ‎08-29-2005

Okay.
Still looking for a solution on this one..

Karthick K S · ‎08-29-2005

Dear Mark,

Thanks For Your Reply,I will tell you clearly.

1.I have korean and japanese database in HPUX 11.11 if i'll give input From Kbd(I dont have Korean KBD,i am changing fonts in font in terminal from windows)Okay.When i use Japanese its working but i trying thro'Korean its not working.Please help out of this..or u can give ur Yahoo ID..

Mark H Smith · ‎08-29-2005

Karthick,
On yahoo I am known as eekarum88, give me a call if you want. Did you look at the site I linked? It pretty well explains how to setup Korean on a Windows box.

My original question still stands:

1. There are two columns in my file, usually with ASCII in the left column and CJK in the right column.
2. But sometimes there is ASCII English or nothing but a tag in the right column. In this case, I want to delete the entire line.
3. I'm only interested in keeping lines with a valid english and CJK translation.

I had thought grep could do this, but maybe not. Would perl be the answer?

john korterman · ‎08-29-2005

Hi,

try this script, using your input file as $1:

#!/usr/bin/sh
while read line
do
LINE_SAVE="$(echo "$line" | tr -d "="| tr -d [:space:])"
STRINGS_LINE=$(echo "$LINE_SAVE"| strings)
if [ "$LINE_SAVE" = "$STRINGS_LINE" ]
then
echo "$line" contains ascii only
echo "$line" >>ascii_yt
else
echo "$line" contains strange chars
echo "$line" >> strange_yt
fi
done <$1

and check the generated ascii_yt and strange_yt files.....

regards,
John K.

it would be nice if you always got a second chance

Mark H Smith · ‎08-30-2005

Thanks for the script John, this works.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: grep for asian characters in UTF8 file

grep for asian characters in UTF8 file