[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: Thoughts about characters transmission
> You've been referring to this encoding in the working group also. Do
> you have a description of how it works? All I need is enough
> information to be able to implement it --
OK.
> no arguments about why it
> is better than UTF-2 are necessary (I can figure that out for myself).
But, to me, it easier to figure out IUTF than to add the argument. :-)
> A pointer to an ftp site or WWW or Gopher server would be fine too.
It's not so lengthy even with the argument. So, it is attached at the
end of this mail.
BTW, I now think that, if we are to use almost raw UTF2 as interim encoding
without enough consideration to many languages with non-European characters,
we should not use two octet UTF2 sequence beginning from T1. That is,
represent all non-ASCII characrters with three octet form of UTF2. Then,
the two octet sequences are reserved for the future international
assignment. Isn't it fair?
Masataka Ohta
IUTF (Internationalized UTF) is an interchange form for
ICODE compatible to UTF2 (UCS Transformation Format 2).
UTF2 is an ASCII compatible variable length multi octet
interchange form for ISO 10646 proposed by X/Open.
UTF2 is designed considering
1) compatibility to UNIX file system
2) compatibility to existing programs
3) easy conversion between UTF2 and ISO 10646
4) that code length can be determined by the first octet
5) that code length is short
6) finite resynchronizability
In UTF2, an octet is classified as
C0:0~32,127
A :33~126
Tx:128~191
T1:192~223
T2:224~239
T3:240~247
T4:248~251
T5:252~253
Ty:254~255(unused)
Then, the following combinations of octets
Octet Sequence code of ISO 10646
C0 0~32,127
A 33~126
T1 Tx 128~2047
T2 Tx Tx 2048~2^16-1
T3 Tx Tx Tx 2^16~2^21-1
T4 Tx Tx Tx Tx 2^21~2^26-1
T5 Tx Tx Tx Tx Tx 2^26~2^31-1
are used to represent characters in ISO 10646. Resynchroni-
zation of character boundaries is possible by scanning at
most 6 characters.
Note that, with UTF2, all the characters of major Euro-
pean languages can be represented in two octets and all the
existing characters of ISO 10646 can be represented in three
octets.
So, IUTF is designed considering
0) compatibility to UTF2
1) compatibility to UNIX file system
2) compatibility to existing programs as interchange code
3) fast conversion between IUTF and ISO 10646
4) that code length can be determined without looking
ahead extra octets
5) that code length is short
6) finite resynchronizability
that is, IUTF is upper compatible to UTF2 both in its format
and its design policy. Note that 2) is rather meaningless
condition as processing code (ICODE, not IUTF, in this case)
is used in exsisting programs, which is also a processing
model of multibyte/wide characters of ANSI C and X/Open.
In UTF2, an octet is classified as
C0:0~32,127
A :33~126
A':33~46,48~126
C1:128~159
Tx:128~191
T1:192~223
T2:224~239(=S2+S3+S4+S6+S7)
S2:224~229
S3:230~235
S4:236~237
S6:238
S7:239
U1:240~255
Then, the following combinations of octets
Octet Sequence code of ISO 10646
C0 0~32,127
A 33~126
T1 Tx 128~4095
T2 Tx Tx 4096~65535
are used to represent characters in UTF2. Thus, IUTF is
compatible to UTF2. Then, the following combinations of
octets are available to represent extra characters.
Octet Sequence number of code points represented
T1 A' 2976
T2 A' 1488
U1 A' 1488
U1 Tx 1024
T1 T2 512
T1 U1 512
U1 T2 256
S2 Tx A' 35712
S3 Tx A' Tx >2^21
S4 Tx A' Tx Tx >2^25
S6 Tx A' Tx Tx Tx Tx >2^36
S7 Tx A' Tx Tx Tx Tx Tx >2^42
Thus, all the character in 21 bit ICODE can be represented
with four octet form by a sequence beginning with S3.
Resynchronization of character boundaries is possible by
scanning at most 8 characters.
As IUTF have extra 8256 (= 2976 + 1488 + 1488 + 1024 +
512 + 512 + 256) two octet representations and 35712 three
octet representations, which can be used for short hand
notations of characters such as frequently used non-European
characters. The actual assignment is not yet determined.
Hash tables could be used for the fast translation from
ICODE to IUTF for such shorthand notations.