Operating System - OpenVMS
1828336 Members
3826 Online
109976 Solutions
New Discussion

Re: unicode character set

 
SOLVED
Go to solution
Brian Duddy
Occasional Advisor

unicode character set

I want to process unicode format files, specifically unicode 16 on an alpha running vms v7.2-1, but the current char set does not support this.
Is there any patch that would upgrade the character set on vms V7.2-1 or is there another work around

Thanks

Brian
13 REPLIES 13
Bojan Nemec
Honored Contributor

Re: unicode character set

Brian,

What do you mean with "process unicode format files"?

If you mean file names in unicode you need an ODS-5 disk. ODS-5 disk structure is supported from VMS 7.2. Please see http://h71000.www7.hp.com/doc/72final/6536/6536pro.html

If you mean file contents then C and C++ (maybe others) have some support for so called wide characters.

Can you post a more specific question?

Bojan
Antoniov.
Honored Contributor

Re: unicode character set

Brian,
unicode support is avaiable since V7.2 but there is no any utility or program to make this work.
I guess you have to write a little C/C++ program that read unicode file, pass string to unicode conversion function and than write into ISO Latin1.
About file system name convention as Bojan posted you need the new filesystem ODS5 but I guess you cannot see any unicode symbol.
The main trouble is VMS is character cell based so you need DecWindows to make full support of unicode.

Antonio Vigliotti
Antonio Maria Vigliotti
Craig A Berry
Honored Contributor

Re: unicode character set

As the other posters have said, you really need to say more about what you mean by "process". I've said pretty much everything I know about this subject here:

http://groups.google.com/groups?selm=6881069ab32a6ead6489d52afc32764a%40news.teranews.com&output=gplain

I don't know if character conversion per se is what you're interested in, but you can find out what conversions you already have by looking here:

$ directory sys$i18n_iconv:
Brian Duddy
Occasional Advisor

Re: unicode character set

Sorry, didnt want to get into too much detail. I will be receiving text files with names and addresses via ftp, I have a test file now sent by email which I can look at in txtpad on pc, as soon as I ftp the file to VMS one of the chars becomes a backwards ?
these unicode/wide chars will be in the file body not in the name.
What I need to do is very simple, read this file on Alpha using VAx Basic and store on file I will at some point have to retrieve this info, again with Vax Basic and output it out to a text/csv file to print letters or send back out to client. Obviously with the chars disappearing as soon as I ftp to alpha this is a problem

thanks

brian
Antoniov.
Honored Contributor

Re: unicode character set

Brian,
if your text files are written by noted they have not unicode format!
Unicode is based on 16 bit character set and it used mainly by java applications.
Text files from PC have 8 bit character set and it's divided into two pages; first page (code from 00 to 127) are standard and it's called ANSI code; second page (code from 128 to 255) are national page; the common used page on PC are PC437 and PC850 while on vms the common page are ISO-Latin1.
Before coonvertion you need known what country is set on PC.

HTH
Antonio Vigliotti
Antonio Maria Vigliotti
Brian Duddy
Occasional Advisor

Re: unicode character set

Antonio, thank for your help

In the Small test file I have VMS cannot handle the following chars wï
where the w is a welsh w character with circumflex, but is not shown, and an I with two dots above. Due to new european legislation our client and therefore I must be able to process any european char or possibly Japanese or Chinese chars

let me correct a previous error on my part, the files are UTF-8 format, they are produced by a british client's systems that is apparently UTF-8 compliant, and even when I open the file in txtpad I still do not see these two chars as described. they appear as
à µà ¯ when I then transfer to vms it doesn't like the last char but I have only noticed that these should have been wï as described above

hope this makes
Brian Duddy
Occasional Advisor

Re: unicode character set

Antonio, thank for your help

In the Small test file I have VMS cannot handle the following chars wï
where the w is a welsh w character with circumflex, but is not shown, and an I with two dots above. Due to new european legislation our client and therefore I must be able to process any european char or possibly Japanese or Chinese chars

let me correct a previous error on my part, the files are UTF-8 format, they are produced by a british client's systems that is apparently UTF-8 compliant, and even when I open the file in txtpad I still do not see these two chars as described. they appear as
à µà ¯ when I then transfer to vms it doesn't like the last char but I have only noticed that these should have been wï as described above

hope this makes
Brian Duddy
Occasional Advisor

Re: unicode character set

thanks for your help so far

Due to new european legislation our client and therefore I must be able to process any european char or possibly Japanese or Chinese chars

The Small test file is supposed to have the following chars wï
where the w is a welsh w character with circumflex, but is not shown here, and an I with two dots above.

let me correct a previous error on my part, the files are UTF-8 format, they are produced by my british client's system that is apparently UTF-8 compliant. Also even when I open the email attachemnt on my pc I still do not see these two chars as described and I have been told they are, they appear as
à µà ¯ when I then transfer to vms it doesn't like the last char but I have only realised that these should have been wï as described above, I have tried to open the files in word, txtpad, iexplorer but cannot see these chars as described.

Therefore my problem has got worse, I cannot look at these chars on my PC and know that VMS will not be able to handle them either

hope this makes
Antoniov.
Honored Contributor

Re: unicode character set

Brian,
rappresentation of char set is a complex work :-(
The big trouble is the device of rappresentation. If you use old VT you can display only some code (ususally accented, greek and cyrillic letters and some others) and you cannot display all togheter.
So if you need view all character (include japaneese and chineese symbols) you MUST use graphical station using DecWindows.
You met same problem in this thread when you tryed display a symbol "the w is a welsh w character with circumflex".

After of this you MUST use unicode (16 bit) rappresentation instead classic 8 bit.

The alternate way is change character set of display device for specific requirement but you will became crazy to make convertion form char set to another :-O

Antonio Vigliotti
Antonio Maria Vigliotti
Brian Duddy
Occasional Advisor

Re: unicode character set

We use PC's running smarterm terminal emulation software in windows nt or xp

you mention using c or c++ to translate the chars into into ISO Latin1, this sounds interesting, as some sort of conversion seems the only way forward, are there any tools out there or do you have any links on the subject

thanks

Brian
Bojan Nemec
Honored Contributor
Solution

Re: unicode character set

Brian,

For converting between codesets you have a VMS command ICONV CONVERT. You must install the VMSI18N kit which is on the Layered products CD or on the Operating System CD. With this kit you receive a lot of translators. You can see them with $ DIR SYS$I18N_ICONV . The names are in the form from_to.ICONV (in your case from will probably be UTF-8). Now you can convert files with:

$ ICONV CONVERT /FROMCODE=from /TOCODE=to -
infile outfile

Please do $ HELP ICONV for more.

Bojan
Craig A Berry
Honored Contributor

Re: unicode character set

Antoniov, Unicode does not necessarily mean 16 bit. It includes 8-bit, 16-bit, and even 32-bit representations. As Brian has now clearly stated, he's got UTF-8, which is a varying-width 8-bit representation. Basically you get a stream of bytes, but some characters are longer than one byte. As long as you don't have to think about characters and can just think about bytes, you can transmit and store UTF-8 without giving much thought to the fact that it's Unicode. That's not a coincidence; UTF-8 was designed to have the least impact on non-Unicode-aware systems.

Other impact depends entirely on what you want to do with it once you've got it. If all you need to do is read data from a file, store it in a database, and then later fetch it out and send it along, you may not need to do anything. You will, however, need to examine your DEC BASIC programs to make sure they don't do anything that would corrupt a UTF-8 character. For example, if you use EDIT$ to remove non-printable ASCII characters, you could easily delete bytes that are valid parts of multi-byte characters.

If you need to display and/or edit the data, that can be much trickier. The easiest display method would be to stick in an HTML file with the appropriate encoding indication in the header and open it with Mozilla, assuming you have the necessary Unicode fonts installed. If you need to edit the data in a VT-based application, you may need to convert back and forth between one of the various national character sets.

You may want to read chapter 10 of the CRTL manual, which has quite a bit about internationalization of applications and is not just limited to C usage:

http://h71000.www7.hp.com/doc/732FINAL/5763/5763pro_018.html#i18n_chap

Brian Duddy
Occasional Advisor

Re: unicode character set

Guys

Thanks for the help on this the VMSI18N kit seems exactly what I want and the link provided was excellent

Cheers

brian