[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Fwd: Last Call: UTF-16, an encoding of ISO 10646 to Proposed



I concur with some of Harald's list of disadvantages for
UTF-16 as an interchange format, but find myself puzzled
by some of the others:

> My list of disadvantages:
> 
> - No compatibility with cstrings due to NULL

This is an obvious problem for interworking with API's that
use 8-bit character sets. But I agree with François that this
issue will disappear over time as people create appropriate
interfaces to work with 16-bit strings. The real issue is not
the NULL's but the datatype difference.

> - Inability to represent characters outside Planes 0-16

WG2 and UTC are converging on a point of view that characters
outside of Planes 0-16 should *never* be assigned. This may be
formally written into 10646. The rationale here is that nearly
all 10646 implementations are following the Unicode Standard, by
necessity, to achieve interoperability in areas that are left
unspecified by 10646. Formalizing this convergence by constraining
the code space range that could ever be assigned standard characters
would close down this nagging issue of incompatibility between
the Unicode Standard and 10646. In that case, UTF-8, UTF-16, and
UTF-32 would *all* have the exact same representational capability,
and would all be completely interconvertible forms.

> - VERY bad expansion factor for characters outside Plane 0 (100% overhead)

This claim I do not understand at all:

scalar value	UTF-8	UTF-16	UTF-32
0..7F		1	2	4
80..7FF		2	2	4
800..FFFD	3	2	4
10000..10FFFD	4	4	4

The only size advantage for UTF-8 is for ASCII values, and UTF-16
has the clear size advantage for East Asian data.

> - No ability to mix ASCII and UTF-16 elements in a simple viewer

This is a very important transitional and developmental advantage
for UTF-8, absolutely.

> - Two incompatible byte orders

Also an admitted problem for UTF-16 and UTF-32, but not significantly
more complex that defining interchange formats for any datatype
that has to be expressed in machine words larger than a byte wide.

> 
> My list of advantages:
> 
> - Does not require conversion between UCS-2 and UTF-16 when only Plane 0
>    characters are used in the UTF-16

UCS-2 is a dead issue in any case. All Unicode implementations should
at this point formally be UTF-16 implementations, whether they are
actually supporting the interpretation of surrogate pairs or not. If
they are claiming conformance to Unicode 2.0 or higher, then they
are UTF-16.

--Ken

> 
> Note that the single advantage may be listed as a disadvantage if there 
> turns out to be lots of applications that "support" UTF-16 the way they 
> currently "support" Unicode - by throwing away the high-order bits....
> 
>                      Harald A
> 
> --
> Harald Tveit Alvestrand, EDB Maxware, Norway
> Harald.Alvestrand@edb.maxware.no
>