1752798 Members
5621 Online
108789 Solutions
New Discussion юеВ

UTF8 & AL32UTF8

 
ericfjchen
Regular Advisor

UTF8 & AL32UTF8

I am using Oracle DB. Can you explain what the difference is between UTF8 and AL32UTF8?
4 REPLIES 4
Florian Heigl (new acc)
Honored Contributor

Re: UTF8 & AL32UTF8

this has a really nice explanation:

http://www.mail-archive.com/perl-unicode@perl.org/msg02239.html

unfortunately... I can't tell wether it's correct. :))
yesterday I stood at the edge. Today I'm one step ahead.
Indira Aramandla
Honored Contributor

Re: UTF8 & AL32UTF8

Hi Eric,

The difference between UTF8 and AL32UTF8 are:

UTF8 stores Unicode characters with code points > U+FFFF as two surrogate characters, three bytes each

AL32UTF8 stores Unicode characters with code points > U+FFFF as one four-byte character

UTF8 will not be updated anymore when new Unicode versions are released, only AL32UTF8 and AL16UTF16 will.

Due to compatibility problems with pre-9i versions use UTF8 if you have Oracle8i clients connecting to the database. Use AL32UTF8 in pure Oracle9i environment.

UTF-8 encoding is variable-width. In UTF-8, each character can be represented by either one, two, or three bytes.

UTF8 is a varying width 1-3 bytes per character Unicode encoding. It is supported for both database and national character sets. It is a binary superset of US7ASCII. UTF8 corresponds to Unicode CESU-8 encoding.

AL32UTF8 is a varying width 1-4 bytes per character. It is supported for CHAR, VARCHAR2, LONG and CLOB only (database character set). It is a binary superset of UTF8 (in 9.2 only) and US7ASCII. AL32UTF8 corresponds to Unicode UTF-8 encoding.


IA

Never give up, Keep Trying
Indira Aramandla
Honored Contributor

Re: UTF8 & AL32UTF8

There is more.

This is what Metalink says. In Note: 237593.1

There is a possible problem for 817 and lower versions: Problems connecting to AL32UTF8 databases from older versions 8i and lower.

The default UTF8 characterset for 9i/10G is AL32UTF8, however this characterset is NOT recognised by any pre-9i clients/server systems.

Oracle recommends that you use UTF8 instead of AL32UTF8 as database characterset if you have 8i (or older) servers and clients connecting to the 9i/10g system until you can upgrade the older versions.

UTF8 is unicode revision 3.0 in 8.1.7 and up. AL32UTF8 is Unicode 3.0 in 9.0.1, Unicode 3.1 in 9.2, Unicode 3.2 in 10.1 and Unicode 4.01 in 10.2

Besides the difference in Unicode version the "big difference" is that AL32UTF8 has build in support for "Surrogate Pairs" (also known as Surrogate characters or "Supplementary characters").
Practically this means that in 99% of the cases you can use UTF8 instead of AL32UTF8 without any problem.

There are only a few situations where Surrogate Pairs are already used on client side. Windows system with HKSCS2001 (hong kong extension) is one of those. Note that you actually can *store* Surrogate Pairs in UTF8 but will store 2 * 3 byte characters and not like AL32UTF8 one 4 byte character.

Note that if you now use UTF8 as database characterset and -in the future you do a roll out of new 9i or higher clients and all your other databases are upgraded to 9i or higher, you can simply do a alter database characterset to go from UTF8 to AL32UTF8 so downtime will be limited to a few minutes if the need to go to AL32UTF8 should arise. There is no performance impact on staying on UTF8

NOTE:
This note is ONLY relevant if you have already a 9i AL32UTF8 database with data in. If you still need to create the 9i system then choose UTF8 instead of AL32UTF8 as database characterset in the database creation assistant.

So, *IF* you have already a 9i system running with AL32UTF8 then you can use the following steps in this note to change the database characterset to UTF8 without losing data.

You can't simply use "ALTER DATABASE CHARACTERSET" to go from AL32UTF8 to UTF8 because UTF8 is a SUB-set of AL32UTF8 (some codepoints which are correct in AL32UTF8 are not known in UTF8)

But again, UTF8 *contains* all characters know in AL32UTF8, the difference between them is pure the way some characters are stored (AL32UTF8 is a bit more efficient for some characters)

So you will run into ORA-12712 if you try alter database ...


IA
Never give up, Keep Trying
Yogeeraj_1
Honored Contributor

Re: UTF8 & AL32UTF8

hi,

see:
http://download-west.oracle.com/docs/cd/B10501_01/server.920/a96529/glossary.htm#432012



UTF8

The UTF8 Oracle character set encodes characters in one, two, or three bytes. It is for ASCII-based platforms. The UTF8 character set supports Unicode 3.0. Although specific supplementary characters were not assigned code points in Unicode until version 3.1, the code point range was allocated for supplementary characters in Unicode 3.0. Supplementary characters are treated as two separate, user-defined characters that occupy 6 bytes.

UTF-8

The 8-bit encoding of Unicode. It is a variable-width encoding. One Unicode character can be 1 byte, 2 bytes, 3 bytes, or 4 bytes in UTF-8 encoding. Characters from the European scripts are represented in either 1 or 2 bytes. Characters from most Asian scripts are represented in 3 bytes. Supplementary characters are represented in 4 bytes.

AL32UTF8

An Oracle character set for the SQL CHAR data type, which is used for the database character set. It encodes Unicode data in the UTF-8 encoding.



note that:

utf-8 is an encoding scheme (like an algorithm)

utf8 is a characterset implemented using the utf-8 ENCODING scheme

UTF8 is a multi-byte encoding scheme, a single character may take many bytes of
storage.

Note that Oracle uses the following naming convention for character set names:

<#_of_bits_representing_a_char>[S] [C] [FIXED]

Note that UTF8 and UTFE are exceptions to this naming convention.

For instance:
* US7ASCII is the U.S. 7-bit ASCII character set
* WE8ISO8859P1 is the Western European 8-bit ISO 8859 Part 1 character set
* JA16SJIS is the Japanese 16-bit Shifted Japanese Industrial Standard character set


hope this helps too!
regards
yogeeraj
No person was ever honoured for what he received. Honour has been the reward for what he gave (clavin coolidge)