- Community Home
- >
- Servers and Operating Systems
- >
- Operating Systems
- >
- Operating System - Linux
- >
- Re: grep for asian characters in UTF8 file
Categories
Company
Local Language
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Discussions
Discussions
Discussions
Forums
Forums
Discussions
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Community
Resources
Forums
Blogs
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-27-2005 03:13 AM
тАО08-27-2005 03:13 AM
Each line has 2 columns: an english sentence in the first column and its C,J,or K translation in the second column. (tab separated)
However some lines contain only English and are of no use to me. So I want to locate and discard the english-only lines containing no valid asian data.
cat -A allows me to see certain escape codes in this file, such as tab (^I), and carriage returns (^M$) and also the -A flag shows me that each asian sentence begins with an uppercase M. I'm assuming that represents some code signalling a switch to asian text.
I guess what I am really looking for is a way to grep for asian characters..
Solved! Go to Solution.
- Tags:
- grep
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-27-2005 04:27 PM
тАО08-27-2005 04:27 PM
Re: grep for asian characters in UTF8 file
cat -v
or have your also tried using:
strings
Also, you might want to verify your Shell Environment variables:
$ set
Just to verify your settings for UTF.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-27-2005 06:58 PM
тАО08-27-2005 06:58 PM
Re: grep for asian characters in UTF8 file
cat -v gives the same result. (just not sure how to interpret escape codes, if any)
running strings (debian binutils) on it only gives me the english-- no asian characters, just empty space..
Output of locale:
LANG=en_US.utf8
LC_CTYPE="en_US.utf8"
LC_NUMERIC="en_US.utf8"
LC_TIME="en_US.utf8"
LC_COLLATE="en_US.utf8"
LC_MONETARY="en_US.utf8"
LC_MESSAGES="en_US.utf8"
LC_PAPER="en_US.utf8"
LC_NAME="en_US.utf8"
LC_ADDRESS="en_US.utf8"
LC_TELEPHONE="en_US.utf8"
LC_MEASUREMENT="en_US.utf8"
LC_IDENTIFICATION="en_US.utf8"
LC_ALL=en_US.utf8
snippet of my file: (see attachment if this appears garbled)
--------------------------
--------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-28-2005 04:56 PM
тАО08-28-2005 04:56 PM
Re: grep for asian characters in UTF8 file
grep -v "^[a-z,A-Z,tabs,spaces]*$
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-28-2005 05:18 PM
тАО08-28-2005 05:18 PM
Re: grep for asian characters in UTF8 file
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-28-2005 09:06 PM
тАО08-28-2005 09:06 PM
Re: grep for asian characters in UTF8 file
Don't know the exact syntax but i think you got the meaning.
Just grep -v all lines which contain only a-z,A-Z,spaces,tabs,nos.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-28-2005 09:35 PM
тАО08-28-2005 09:35 PM
Re: grep for asian characters in UTF8 file
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-28-2005 10:09 PM
тАО08-28-2005 10:09 PM
Re: grep for asian characters in UTF8 file
I think Gnu grep has a option to look for ascii values of characters.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-28-2005 10:22 PM
тАО08-28-2005 10:22 PM
Re: grep for asian characters in UTF8 file
This is not related to ur question but i need some information regarding korean language setup.
1.If i using ICONV cmd it converting english to Korean but if i am trying thro'Keyboard(I changed Korean fonts in windows),not getting i/p from kbd(only english taking)so pls help me for this
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-28-2005 11:18 PM
тАО08-28-2005 11:18 PM
Re: grep for asian characters in UTF8 file
perhaps you could instead look for a certain line structure, e.g. lines ending with a certain code not followed by any text.
For instance, based on your attachment, start by discarding lines ending in either "
- and afterwards throw away the rest, among other lines, those ending in "LS31" followed by two digits.
If that is an idea, try the below single-line example as a starting point:
# grep -vE ".*[
regards,
John K.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-28-2005 11:51 PM
тАО08-28-2005 11:51 PM
Re: grep for asian characters in UTF8 file
This will help me to trim down the file (maybe by 1/3).. But there are also lines with an english sentence in both columns, such as the one above (hidden in the middle) in my example:
For whatever reason, lines like that didn't translate to the asian language, so I want to get rid of them.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-29-2005 12:51 AM
тАО08-29-2005 12:51 AM
Re: grep for asian characters in UTF8 file
I couldn't completely understand your english, but hopefully you can understand mine - (Happy to answer your question, even though we're getting a little off-topic here)
Unfortunately there is no freeware utility (that I know of) other than iconv for converting codepages on Windows. The standard korean unix codepage is "euckr", and the standard korean windows codepage is "949". To verify what codepage you're running under, open a CMD window and type 'chcp'.
But I don't use those old codepages, personally I prefer utf8 for everything (locale, input, files, filesystems) since it allows me to mix different languages together.
I think your question was about configuring Windows to input Korean.
You won't need to convert any files with iconv in order to do this:
http://webhard.daewoo.com/jsp/hangul.html
- Tags:
- IConv
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-29-2005 01:59 AM
тАО08-29-2005 01:59 AM
Re: grep for asian characters in UTF8 file
hmmmmm, you have to know the structure and byte representation of the various languages used in your source.....
This is a guess based on a binary transfer of your attachment from which the translated line "it's a picnic" was xd'd to this result:
0000000 3c53 6567 204c 3d45 4e5f 5553 3e49 7427
0000010 7320 6120 7069 636e 6963 2e2e 2e09 3c53
0000020 6567 204c 3d4a 415f 4a50 3ee3 8394 e382
0000030 afe3 838b e383 83e3 82af e381 abe7 b5b6
0000040 e5a5 bde3 81ae e5ad a3e7 af80 e381 a7e3
0000050 8199 2e2e 2e0d 0a00
The English word "picnic" is
7069 636e 6963
in the second line.
My guess is that Japanese in your attachment uses three bytes per character, of which the first byte functions as a sort of classifier, e.g. there seems to be a systematic sequence of two bytes after "e3":
e3 83 94
e3 82 af
e3 83 8b
e3 83 83
e3 82 af
e3 81 ab
followed by these bytes:
e7 b5 b6
e5 a5 bd
e3 81 ae
e5 ad a3
e7 af 80
of which the last part forms an almost rhyme scheme-like structure (haiku?).
The point is of course that you have to know what to look for. You can xd each line and grep for significant, classifying bytes that are used for the languages only, but you have to consult your source in order to find out what that could be.
regards,
John K.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-29-2005 02:51 AM
тАО08-29-2005 02:51 AM
Re: grep for asian characters in UTF8 file
#Japanese
├и┬к ├з ├д┬╝ ├г ┬о├г ├ж┬б ├е
0000000 aae8 e795 9f94 bce4 e39a ae81 81e3 e694
0000020 88a1 86e5 0a85
0000026
├е┬н ├г ┬й├г ├д┬╝ ├г ┬о├г ├г ┬╝├г ├г ┬г├г ┬╝
0000000 ade5 e390 a981 82e3 e482 9abc 81e3 e3ae
0000020 9183 83e3 e3bc 8683 82e3 e3a3 bc83 000a
0000037
├з┬╡ ├е┬й ├е┬╝ ├д┬║ ├ж┬м┬б├д┬╝ ├г ┬о├г ├ж┬б ├е
0000000 b5e7 e590 9aa9 bce5 e48f 8cba ace6 e4a1
0000020 9abc 81e3 e3ae 9481 a1e6 e588 8586 000a
0000037
#Korean
├м ├м ┬╝ ├н ├н ┬░
0000000 83ec ec9d bc9d ed20 8c8c 8bed 0ab0
0000016
├н ┬┤├л ├н ┬░├м ┬┤ ├л ├л┬м┬┤ ├м┬в ├м ├м
0000000 9ced ebb4 808c 8fed ecb0 b49d eb20 8884
0000020 aceb 20b4 a2ec ec8b 8495 9aec 0a94
0000036
├м┬▓┬н├м┬▓┬й├м ┬е
0000000 b2ec ecad a9b2 9eec 0aa5
0000012
#Chinese
├ж ├з ┬▒├ж ├з ├з ┬╡├и┬п
0000000 88e6 e791 b188 88e6 e791 849a 94e7 e8b5
0000020 9daf 000a
0000023
├ж ┬╢├д┬┐┬б├д┬║┬║├й ┬о├ж ┬┐├з┬╝ ├з
0000000 94e6 e4b6 a1bf bae4 e9ba ae82 94e6 e7bf
0000020 96bc a0e7 0a81
0000026
├ж ┬╢├д┬╗┬╢├д┬║┬║├е ├е ┬м├е┬о┬д├з ┬╡├и┬п ├е ┬╖├з
0000000 94e6 e4b6 b6bb bae4 e5ba 9e8a 85e5 e5ac
0000020 a4ae 94e7 e8b5 9daf 8fe5 e7b7 81a0 000a
0000037
Although I'm not really sure what to look for.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-29-2005 05:07 AM
тАО08-29-2005 05:07 AM
Re: grep for asian characters in UTF8 file
this is a a typographical arrangement of od -x of the Korean version of "I love my phone!" on my system:
ed 9c b4
eb 8c 80
ed 8f b0
ec 9d b4
20
eb 84 88
eb ac b4
20
ec a2 8b
ec 95 84
ec 9a 94
21 20 0d
0a 00
My guess is that the characters are represented by byte triplets starting with either "eb", "ec", or "ed"; "20" is probably space and "21" the exclamation mark - it might very well be wrong, but you have to ask your supplier.
Since you did not get something similar, I suspect yours was not performed as a binary download.
regards,
John K.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-29-2005 02:10 PM
тАО08-29-2005 02:10 PM
Re: grep for asian characters in UTF8 file
Still looking for a solution on this one..
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-29-2005 05:31 PM
тАО08-29-2005 05:31 PM
Re: grep for asian characters in UTF8 file
Thanks For Your Reply,I will tell you clearly.
1.I have korean and japanese database in HPUX 11.11 if i'll give input From Kbd(I dont have Korean KBD,i am changing fonts in font in terminal from windows)Okay.When i use Japanese its working but i trying thro'Korean its not working.Please help out of this..or u can give ur Yahoo ID..
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-29-2005 08:55 PM
тАО08-29-2005 08:55 PM
Re: grep for asian characters in UTF8 file
On yahoo I am known as eekarum88, give me a call if you want. Did you look at the site I linked? It pretty well explains how to setup Korean on a Windows box.
My original question still stands:
1. There are two columns in my file, usually with ASCII in the left column and CJK in the right column.
2. But sometimes there is ASCII English or nothing but a
3. I'm only interested in keeping lines with a valid english and CJK translation.
I had thought grep could do this, but maybe not. Would perl be the answer?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-29-2005 10:02 PM
тАО08-29-2005 10:02 PM
Solutiontry this script, using your input file as $1:
#!/usr/bin/sh
while read line
do
LINE_SAVE="$(echo "$line" | tr -d "="| tr -d [:space:])"
STRINGS_LINE=$(echo "$LINE_SAVE"| strings)
if [ "$LINE_SAVE" = "$STRINGS_LINE" ]
then
echo "$line" contains ascii only
echo "$line" >>ascii_yt
else
echo "$line" contains strange chars
echo "$line" >> strange_yt
fi
done <$1
and check the generated ascii_yt and strange_yt files.....
regards,
John K.
- Tags:
- tr
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-30-2005 09:11 AM
тАО08-30-2005 09:11 AM