Operating System - HP-UX
1825060 Members
5136 Online
109679 Solutions
New Discussion юеВ

Strange characters in text file

 
Carl Houseman
Super Advisor

Strange characters in text file

Just this week started getting complaints about printouts generated by Oracle. Nothing should have changed in Oracle-ville so I'm trying to diagnose this from an HP-UX (11.11) perspective for now.

This is the wierd part. The file contains several lines where a dash appears with a space on either side. Just ONE of those occurances in the entire file, the dash appears, still with spaces on both sides, as:

╬У├З├┤

If I put it through more, this what-should-be ONE dash character appears as:

M-bM-^@M-^S

Can anyone interpret that? And here's the kicker. If I ftp it to another HP-UX system, the resulting file has the same defect. But if I use ftp to copy it to a Windows PC (pulling the file from HP-UX), using in either ascii or binary, the dash appears as a dash when viewed in notepad.

As they say, WTF ?
14 REPLIES 14
Carl Houseman
Super Advisor

Re: Strange characters in text file

OK, I used hexedit and found the 3 characters where a dash should be are:

E2 80 93

which is UTF8 for dash.

New problem, if I try to convert the file using:

iconv -f utf8 -t iso81 filename>newfile

the E28093 and other UTF8 sequences are changed to 1A (^Z).

Is this the best iconv can do?
Dennis Handly
Acclaimed Contributor

Re: Strange characters in text file

>found the 3 characters where a dash should be are: E2 80 93
>which is UTF8 for dash.

I don't see that. In /usr/lib/nls/loc/charmaps/utf8.cm I see
\xe2\x80\x95 #2015
\xc2\xad #00AD

>the E28093 and other UTF8 sequences are changed to 1A (^Z).

This must be the "galley character".
Andrew C Fieldsend
Respected Contributor

Re: Strange characters in text file

This looks suspiciously like issues with data entered into Oracle from a Windows client - I've seen something simmilar before.

Some Windows applications (MS Word in particular) will automaticaly change a hyphen to a long dash (or em-dash) for typographical reasons, and then save that in UTF-8. If that then goes into an Oracle database which is running in UTF-8, you could get similar problems to this. (I've actually seen this happen.)

The presence of ctrl-Z hints at Windows/DOS as well, since ctrl-Z was the EOF marker in DOS, and still crops up occasionally for historical reasons.
Dennis Handly
Acclaimed Contributor

Re: Strange characters in text file

>Andrew: The presence of ctrl-Z hints at Windows/DOS as well

No, in this case it comes from iconv(1) as Carl said. Probably because ^Z is SUB.
Andrew C Fieldsend
Respected Contributor

Re: Strange characters in text file

Sorry Dennis, I misread the ^Z part.
Carl Houseman
Super Advisor

Re: Strange characters in text file

Dennis, thanks for the tip about utf8.cm. These are the UTF8 chars in the file which iconv doesn't handle and which aren't in utf8.cm:

(en dash, e28093)
http://www.fileformat.info/info/unicode/char/2013/index.htm

(right double quote, e2809d)
http://www.fileformat.info/info/unicode/char/201d/index.htm

Andrew, I can believe the part about Windows client - probably copy/pasted from word or some such thing.

So now I guess I'm looking for a patch to make iconv handle these and possibly other missing characters. Only problem is figuring out which one. Any hints welcome.

BTW I found this:
http://www.docs.hp.com/en/5991-1194/5991-1194.pdf

which clearly talks about EN DASH at e28093, but makes no mention of e2809d. It also talks about patches but doesn't identify them in any meaningful way.
Carl Houseman
Super Advisor

Re: Strange characters in text file

Dennis, thanks for the tip about utf8.cm. These are the UTF8 chars in the file which iconv doesn't handle and which aren't listed in utf8.cm:

(en dash, e28093)
http://www.fileformat.info/info/unicode/char/2013/index.htm

(right double quote, e2809d)
http://www.fileformat.info/info/unicode/char/201d/index.htm

Andrew, I can believe the part about Windows client - probably copy/pasted from word or some such thing.

So now I guess I'm looking for a patch to make iconv handle these and possibly other missing utf8 characters. Only problem is figuring out which patch(es) address the problem. Any hints welcome. Searching on "utf8" in the Patch Database found matches but none specifically talk about adding missing utf8 characters. And searching on "utf8.cm" didn't produce anything.

BTW I found this:
http://www.docs.hp.com/en/5991-1194/5991-1194.pdf

which clearly talks about EN DASH at e28093, but makes no mention of e2809d. It also talks about patches but doesn't identify them in any meaningful way.
Marco A.
Esteemed Contributor

Re: Strange characters in text file

Hello,


What I see is that probably it is taking all the windows format (as ussual), then you need to modify its format to unix readable, have you tried using the dos2ux command ?

dos2ux is useful then transferring files between different OS's, etc.

Syntax:

# dos2ux weirdfile > ux_formatedfile

Then after that try to check the ux_formatedfile , maybe it has a new format and recognized by unix.

Try it and let us know.

Regards,

Marco
Just unplug and plug in again ....
Carl Houseman
Super Advisor

Re: Strange characters in text file

Marco if you read my last followup, you'll see that file transfers are not the source of the problem.
Marco A.
Esteemed Contributor

Re: Strange characters in text file

hmmm , and have you tried the command, I've had the same issue a lot of times with files from another applications, etc.
All those ^Z ^M etc etc, are windows "chars" , obviously when transferring them a Win OS the system recognize what it has to recognize, avoiding the flags in all the document, like the spaces, enters, tabs, etc, etc.

Dos2ux did help me with that, but if that's not the issue, let me review more.

Regards,

Marco
Just unplug and plug in again ....
Carl Houseman
Super Advisor

Re: Strange characters in text file

Marco, once again, please read the prior posts carefully. Another poster already made the incorrect assumption that "^Z comes from Windows" and was corrected about that. You're on the wrong track here.

What I need now are are the patch #'s that add these missing utf8 characters to HP-UX 11.11.
Dennis Handly
Acclaimed Contributor

Re: Strange characters in text file

>So now I guess I'm looking for a patch to make iconv handle these and possibly other missing characters. Only problem is figuring out which one.

(Thanks for the links.)

Well, I'm not sure what good it would do. iconv(1) will only work if both charmaps are changed. Of course utf8.cm should have them for completeness. You should contact the Response Center and file an enhancement.
(It hasn't changed for 11.31 either.)

>which clearly talks about EN DASH at e28093, but makes no mention of e2809d.

Yes.

>So now I guess I'm looking for a patch to make iconv handle these and possibly other missing utf8 characters. Only problem is figuring out which patch(es) address the problem.

I found PHCO_29903
11.11 iconv cumulative patch
But it may not help you.

>Searching on "utf8" in the Patch Database found matches but none specifically talk about adding missing utf8 characters. And searching on "utf8.cm" didn't produce anything.

Right. Though they could fix the shared libs but not the charmaps?

You do know that you can create your own maps?
See genxlt(1), dmpxlt(1) and iconv(3C)
$ dmpxlt /usr/lib/nls/iconv/tables/ucs2=iso81
shows:
#What: A.10.02 $ucs2 =) iso81
#Galley: 0X1a

Your iconv(1) command sees to open:
/usr/lib/nls/iconv/hpux32/tables.1/ucs2=iso81

I don't see any but the 8 bit identity translations, even in the raw binary file.
Carl Houseman
Super Advisor

Re: Strange characters in text file

Thanks Dennis - I really appreciate the info about 11.31 not listing these characters. That pretty much confirms that HP-UX isn't up to date on utf8 characters, and there probably isn't a patch available.

Meanwhile I'm going to wait and see if the workaround (don't paste from Word) becomes too much for users to bear before going the next mile for a solution.
Hein van den Heuvel
Honored Contributor

Re: Strange characters in text file

>> Meanwhile I'm going to wait and see if the workaround (don't paste from Word)

instead of, or in addition to, iconv how about a workaround in the form of a quick sed or perl filter for the file.

Untested!! Maybe something like:

perl -pe "s/\xe2\x80\x95/--/g;s/\xc2\xad/-/g" old > new

fwiw,
Hein.