Operating System - HP-UX
1820879 Members
5312 Online
109628 Solutions
New Discussion юеВ

REMOVING UNICODE CHARACTERS from file

 
MBacc
Occasional Contributor

REMOVING UNICODE CHARACTERS from file

Does anyone know how to remove unicode characters from a file in unix?
2 REPLIES 2
Steven E. Protter
Exalted Contributor

Re: REMOVING UNICODE CHARACTERS from file

Shalom,

It would help to know what characters specifically and how they got there. Samba? FTP transfer. Email as an attachment? If so how was the file transmitted.

dos2unix

See the man page, it might help.

SEP
Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
Matti_Kurkela
Honored Contributor

Re: REMOVING UNICODE CHARACTERS from file

Your problem can be re-phrased as "remove everything that is not an ASCII control character or an ASCII printable character".

When a problem is presented in this way, it's easy to find a solution using the standard "tr" command.

Example: file.utf8 contains Unicode UTF8 characters, and file.txt will be the stripped version.

export LC_ALL=C
tr -dc '[:cntrl:][:print:]' < file.utf8 > file.txt
unset LC_ALL

Setting the environment variable LC_ALL to C for the duration of this command is important: it explicitly switches off the Unicode support and tells tr that only ASCII characters are considered to be "printable".

This command can be run as an one-liner too:

LC_ALL=C tr -dc '[:cntrl:][:print:]' < file.utf8 > file.txt

MK
MK