Operating System - Linux
1752627 Members
5218 Online
108788 Solutions
New Discussion юеВ

grep for asian characters in UTF8 file

 
SOLVED
Go to solution
Mark H Smith
Advisor

grep for asian characters in UTF8 file

I have a UTF-8 text file containing four languages: English, Chinese, Japanese, and Korean.

Each line has 2 columns: an english sentence in the first column and its C,J,or K translation in the second column. (tab separated)

However some lines contain only English and are of no use to me. So I want to locate and discard the english-only lines containing no valid asian data.

cat -A allows me to see certain escape codes in this file, such as tab (^I), and carriage returns (^M$) and also the -A flag shows me that each asian sentence begins with an uppercase M. I'm assuming that represents some code signalling a switch to asian text.

I guess what I am really looking for is a way to grep for asian characters..
19 REPLIES 19
D Block 2
Respected Contributor

Re: grep for asian characters in UTF8 file

Mark- how about sending us an example output.. Also, have your tried using the:

cat -v

or have your also tried using:

strings

Also, you might want to verify your Shell Environment variables:

$ set

Just to verify your settings for UTF.

Golf is a Good Walk Spoiled, Mark Twain.
Mark H Smith
Advisor

Re: grep for asian characters in UTF8 file

I have attached a snippet of my file.

cat -v gives the same result. (just not sure how to interpret escape codes, if any)

running strings (debian binutils) on it only gives me the english-- no asian characters, just empty space..

Output of locale:

LANG=en_US.utf8
LC_CTYPE="en_US.utf8"
LC_NUMERIC="en_US.utf8"
LC_TIME="en_US.utf8"
LC_COLLATE="en_US.utf8"
LC_MONETARY="en_US.utf8"
LC_MESSAGES="en_US.utf8"
LC_PAPER="en_US.utf8"
LC_NAME="en_US.utf8"
LC_ADDRESS="en_US.utf8"
LC_TELEPHONE="en_US.utf8"
LC_MEASUREMENT="en_US.utf8"
LC_IDENTIFICATION="en_US.utf8"
LC_ALL=en_US.utf8


snippet of my file: (see attachment if this appears garbled)
--------------------------
LS-3108 LS-3108

LS-3114 LS-3114
LS-3108 LS-3108

LS-3114 LS-3114
LS-3108 LS-3108

LS-3114 LS-3114
It must be Easter!
It must be Easter!

It's a 60's shindig...
It's a 60's shindig...

It's a baby bash!

It's a baby bash!
It's a birthday fete! ├и┬к ├з ├д┬╝ ├г ┬о├г ├ж┬б ├е
It's a birthday fete! ├м ├м ┬╝ ├н ├н ┬░
It's a party! ├е┬н ├г ┬й├г ├д┬╝ ├г ┬о├г ├г ┬╝├г ├г ┬г├г ┬╝
It's a party! ├н ├н ┬░!
It's the 4th of July.

It's the 4th of July.
It's the beginning of post-adolescent neuroses!

It's the beginning of post-adolescent neuroses!
It's time for a shower. ├з┬╡ ├е┬й ├е┬╝ ├д┬║ ├ж┬м┬б├д┬╝ ├г ┬о├г ├ж┬б ├е

It's time for a shower. ├м┬▓┬н├м┬▓┬й├м ┬е 3
It's a picnic... ├г ├г ┬п├г ├г ├г ┬п├г ┬л├з┬╡┬╢├е┬е┬╜├г ┬о├е┬н┬г├з┬п ├г ┬з├г ...
It's a picnic...
I love my phone! ├ж ├з ┬▒├ж ├з ├з ┬╡├и┬п !
I love my phone! ├н ┬┤├л ├н ┬░├м ┬┤ ├л ├л┬м┬┤ ├м┬в ├м ├м !
Running ├ж┬н┬г├е ┬и├и┬┐ ├и┬б
And I've been head over heels ever since. And I've been head over heels ever since.

And I've gone gaga over you.

file and linker output. ├н ├м ┬╝├м ├л┬з ├л ┬д├к┬│  ├м ├м ┬╡├л ├л ┬д.

And may you hop with happiness.

And Mom's never forgiven you for it!
Recipient Postal Code ├ж ┬╢├д┬┐┬б├д┬║┬║├й ┬о├ж ┬┐├з┬╝ ├з 

Recipient qualifier ├ж ┬╢├д┬╗┬╢├д┬║┬║├й ├е┬о ├з┬м┬ж

Recipient queue name length ├ж ┬╢├д┬╗┬╢├д┬║┬║├й ├е ├е ├з┬з┬░├й ┬┐├е┬║┬ж

Recipient Queue ├ж ┬╢├д┬╗┬╢├д┬║┬║├й ├е

Recipient routing address ├ж ┬╢├д┬╗┬╢├д┬║┬║├и┬╖┬п├з ┬▒├е ┬░├е

{Recipient's Home Telephone #} {├ж ┬╢├д┬╗┬╢├д┬║┬║├е┬о┬╢├е┬║┬н├з ┬╡├и┬п ├е ┬╖├з  }

{Recipient's Name} {├ж ┬╢├д┬╗┬╢├д┬║┬║├е┬з ├е }

{Recipient's Office Location} {├ж ┬╢├д┬╗┬╢├д┬║┬║├е ├е ┬м├е ┬░├з ┬╣}

{Recipient's Office Telephone #} {├ж ┬╢├д┬╗┬╢├д┬║┬║├е ├е ┬м├е┬о┬д├з ┬╡├и┬п ├е ┬╖├з  }
--------------------------
Vibhor Kumar Agarwal
Esteemed Contributor

Re: grep for asian characters in UTF8 file

Will this work:

grep -v "^[a-z,A-Z,tabs,spaces]*$
Vibhor Kumar Agarwal
Mark H Smith
Advisor

Re: grep for asian characters in UTF8 file

afraid not.. I still get all those unwanted ASCII-only lines in the output..
Vibhor Kumar Agarwal
Esteemed Contributor

Re: grep for asian characters in UTF8 file

I missed a word "something like this"

Don't know the exact syntax but i think you got the meaning.

Just grep -v all lines which contain only a-z,A-Z,spaces,tabs,nos.
Vibhor Kumar Agarwal
Mark H Smith
Advisor

Re: grep for asian characters in UTF8 file

I want to filter out all ASCII-only lines. In order to accomplish this, don't I want to grep for the asian characters, and not those between A-Z?
Vibhor Kumar Agarwal
Esteemed Contributor

Re: grep for asian characters in UTF8 file

Have a look into this:

I think Gnu grep has a option to look for ascii values of characters.
Vibhor Kumar Agarwal
Karthick K S
Frequent Advisor

Re: grep for asian characters in UTF8 file

Hi Mark,

This is not related to ur question but i need some information regarding korean language setup.

1.If i using ICONV cmd it converting english to Korean but if i am trying thro'Keyboard(I changed Korean fonts in windows),not getting i/p from kbd(only english taking)so pls help me for this

john korterman
Honored Contributor

Re: grep for asian characters in UTF8 file

Hi Mark,


perhaps you could instead look for a certain line structure, e.g. lines ending with a certain code not followed by any text.
For instance, based on your attachment, start by discarding lines ending in either "" or "" followed by only a single space
- and afterwards throw away the rest, among other lines, those ending in "LS31" followed by two digits.
If that is an idea, try the below single-line example as a starting point:

# grep -vE ".*[|] $" ./infile| grep -v "LS-31[0-9][0-9]"


regards,
John K.
it would be nice if you always got a second chance