Operating System - OpenVMS
1748158 Members
4023 Online
108758 Solutions
New Discussion юеВ

Re: iconv () ISO8859-5 -> UCS-2 byte order = bug or feature ?

 
SOLVED
Go to solution
Ruslan Laishev
Occasional Advisor

iconv () ISO8859-5 -> UCS-2 byte order = bug or feature ?

Hi All!

 

#include  <stdio.h>
#include  <iconv.h>
#include  <errno.h>

int  main()
{
int     status;
iconv_t  cd;
char    sts[] = "├Т├е├▒├▓ ├С├М├С", *inbuf = sts;
char    buf [ 128 ],*outbuf = buf;
unsigned        inbytesleft = sizeof(sts)-1,
        outbytesleft = sizeof(buf);

//      if ( !(cd = iconv_open("UCS-2","ISO8859-5")) )
//      if ( !(cd = iconv_open("UTF-8","ISO8859-5")) )
        if ( !(cd = iconv_open("UCS-2","ISO8859-5")) )
                perror("iconv_open");

        if ( 0 > (status = iconv (cd,&inbuf,&inbytesleft,&outbuf,&outbytesleft)) )
                perror("iconv");


        for (int i = 0; i < (sizeof(buf) - outbytesleft); i++)
                printf("%x ",buf[i]);

        if ( 0 > (status = iconv_close(cd)))
                perror("iconv_close");
}

 

$ cc iconv
$ link iconv

$ r iconv
22 4 35 4 41 4 42 4 20 0 21 4 1c 4 21 4

 

I expected :

4 22 4 35 ...

 

What you think ?

 

 

8 REPLIES 8
H.Becker
Honored Contributor

Re: iconv () ISO8859-5 -> UCS-2 byte order = bug or feature ?

Feature, aka Little Endian: UCS-2 is a sequence of 16-bit code units.

 

Print the 16-bit codes and see if that comes closer to what you expect:

for (int i = 0; i < (sizeof(buf) - outbytesleft); i+=2)
        printf("U+%04X ",*(unsigned short*)(buf+i));

 

Ruslan Laishev
Occasional Advisor

Re: iconv () ISO8859-5 -> UCS-2 byte order = bug or feature ?

I'm need to send the encoded block over the network, so I don't want playing with the printf(). :-)

Hoff
Honored Contributor

Re: iconv () ISO8859-5 -> UCS-2 byte order = bug or feature ?

Check the encoding of the characters stored in your input buffer.

 

In typical VMS character encoding within files, you'd get MCS or ISO Latin-1 (ISO 8859-1) in there, and AFAIK not ISO 8859-5.

 

I don't have the ISO 8859-5 encoding available on the OpenVMS system I'm testing with, though the Unix box (below) has it.  The VMS test (below) is with ISO-8859-1 encoding.

 

Here's the C code I'm using, with the display corrupted irrespective of the code insertion box used.

 

#include  <stdio.h>
#include  <stdlib.h>
#include  <string.h>
#include  <iconv.h>
#include  <errno.h>

PrintString( char *TheString, size_t TheStringLen ) {
int i;
        for (i = 0; i < TheStringLen; i++)
                printf("%2.2x ",(unsigned char)TheString[i]);
        printf("\n");
}

int  main()
{
#define OUTBUFLEN 128
int     status;
iconv_t  cd;
char    *inbuf = "├Т├е├▒├▓ ├С├М├С";
char    *inbufp = inbuf;
char    outbuf[OUTBUFLEN];
char    *outbufp = outbuf;
size_t  inbytesleft = strlen( inbuf ) - 1;
size_t  outbytesleft = sizeof(outbuf);

        PrintString( inbuf, strlen( inbuf ) );

        // if ( !(cd = iconv_open("UTF-8","ISO8859-1")) )
        if ( !(cd = iconv_open("UCS-2","ISO8859-5")) )
                perror("iconv_open");

        if ( (status = iconv (cd,&inbufp,&inbytesleft,&outbufp,&outbytesleft)) )
                perror("iconv");

        PrintString( outbuf, OUTBUFLEN - outbytesleft );

        if ( (status = iconv_close(cd)))
                perror("iconv_close");

        exit( EXIT_SUCCESS);
}

 

$! OpenVMS
$ cc x
$ link x
$ run x
43 43 25 43 31 43 32 20 43 43 0c 43 
43 00 43 00 25 00 43 00 31 00 43 00 32 00 20 00 43 00 43 00 0c 00
 
$# Unix, ISO8859-5
$ cc -arch x86_64 -lc -liconv  x.c
$ ./a.out
c3 92 c3 a5 c3 b1 c3 b2 20 c3 91 c3 8c c3 91 
04 23 00 92 04 23 04 05 04 23 04 11 04 23 04 12 00 20 04 23 00 91 04 23 00 8c 04 23 

$# Unix, ISO8859-1
$ ./a.out
c3 92 c3 a5 c3 b1 c3 b2 20 c3 91 c3 8c c3 91 
00 c3 00 92 00 c3 00 a5 00 c3 00 b1 00 c3 00 b2 00 20 00 c3 00 91 00 c3 00 8c 00 c3

 

Yes, it's my old friend "Your post has been changed because invalid HTML was found in the message body. The invalid HTML has been removed. Please review the message and submit the message when you are satisfied."   Which means I can't select the Preview here before posting, as this software is unable to accomplish the removal of whatever HTML instantiated itself.  

 

It would NOT surprise me to learn it was the results of the Insert Code dialog box at fault, as that got stripped out.

 

Oh, and it's not just Preview that's broken.  That spellcheck stuff is broken; you can't use the X to close the dialog box, but it does cause the contents of the box to go blank.

 

Oh, and it's not just Preview and Spellcheck that's broken.  Yes, once I got to the Preview, the code inclusion mechanism is broken, too.  And two consecutive code insertion boxes get coallesced into one.

 

This forum software is just hilariously bad.

 

Spending far too much time on trying to work around the many limits of this forum software, the C code and the example output is also attached, and hopefully the ASCII file attachment won't get corrupted here. 

H.Becker
Honored Contributor

Re: iconv () ISO8859-5 -> UCS-2 byte order = bug or feature ?

I'm need to send the encoded block over the network, so I don't want playing with the printf(). :-)

Aha, that's what you "expected": Network Byte Order, which is Big Endian. Sounds like you need to use htons() : host to network short - one of the "Convert multi-byte integer types from host byte order to network byte order" functions.

Ruslan Laishev
Occasional Advisor

Re: iconv () ISO8859-5 -> UCS-2 byte order = bug or feature ?

Hi, Hoff!

 

So on the Unix and VMS we see different byte order. So, this is a result of difference in the endian or difference in the C run-times?

 

Thanks!

 

Ruslan Laishev
Occasional Advisor

Re: iconv () ISO8859-5 -> UCS-2 byte order = bug or feature ?


@H.Becker wrote:

I'm need to send the encoded block over the network, so I don't want playing with the printf(). :-)

Aha, that's what you "expected": Network Byte Order, which is Big Endian. Sounds like you need to use htons() : host to network short - one of the "Convert multi-byte integer types from host byte order to network byte order" functions.


I have used htons() routine before sending block. But I want to understand a nature of the phenomen. :-)

 

Thanks!

Hoff
Honored Contributor

Re: iconv () ISO8859-5 -> UCS-2 byte order = bug or feature ?

htons() and ntohs() would work, though the external data routines (xdr) are built for this byte-swizzling stuff.

Hoff
Honored Contributor
Solution

Re: iconv () ISO8859-5 -> UCS-2 byte order = bug or feature ?

>So on the Unix and VMS we see different byte order. So, this is a result of difference in the endian or difference in the C run-times?

 

The C standard doesn't specify its endianmess.  Platforms compliant with Unix can be big- or little-endian, and variously bi-endian.  VMS itself and its applications are little endian.  However in one case, VMS runs little-endian guest on a platform and host operating system that is running big-endian.   Here is some reading.   For the inevitable "fun" that can arise with unicode and friends, also go read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).  And also some VMS-related reading here and here.

 

edit: highlighted the quote with >, as the forum software here ate the italics. 

edit2: the forum software graciously a few control characters onto one of the URLs.