[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: internationalization/ISO10646 question - UTF-16
begin quotation by Markus Scherer on 2002/12/19 14:03 -0800:
> Chris Newman wrote:
>> UTF-16 is a terrible encoding for interoperability. There are 3
>
> Not true, especially if it's declared properly. It is interoperable, and
> it is at least as compact as, or more compact than, UTF-8 for all
> non-Latin texts.
If the people who created UTF-16 hadn't messed around with the BOM crap and
instead made it was network byte order in files, interfaces and on the
network, then it would interoperate well. But in today's world, UTF-16
will interoperate just as well as TIFF does since it made the same mistake
(actually worse that TIFF since the BOM is optional). I've seen programs
which offer to save TIFF files in "Mac format" (Big endian) or "PC format"
(Little endian) -- just to show you how well that game works. Meanwhile
JFIF/JPEG, PNG, and GIF interoperate well because they mandated endian.
>> published non-interoperable variants of UTF-16 (big-endian,
>> little-endian, BOM/switch-endian) and only one of the variants can be
>
> Yes, but the variants are minor - endianness and BOM.
But more than sufficient to cause user-visible interoperability problems.
See past experience with TIFF.
>> auto-detected with any chance of success (and none of them can be
>> auto-detected as well as UTF-8). It's not a fixed-width encoding, so
>> you don't get the fixed-width benefits that UCS-4 would provide (unless
>
> Well, few encodings are fixed-width, and some popular encodings are a lot
> more complicated. Fixed-width encodings are useful for processing, but
> this is not an issue for transport.
Exactly true. For transport, interoperability trumps all other
requirements. I brought this up because there is a common misconception
that UTF-16 is fixed-width. Well it's mostly fixed width -- meaning you
get none of the advantages of a fixed-width encoding, but because the
variable width case is uncommon it adds a new set of interoperability
problems related to those additional characters. It violates the "avoid
uncommon cases" design rule. Because UTF-8 is variable width in the common
case it's much more likely to interoperate over the entire Unicode
repertoire than UTF-16.
>> So this raises the question: why would any sensible protocol designer
>> ever what to transport UTF-16 over the wire? There may be a few rare
>> corner cases where it makes sense, but in general UTF-8 is superior in
>> almost all instances. I suspect the only reason we see UTF-16 on the
>> wire is because some programmers are too lazy to convert from an
>> internal variant of UTF-16 to interoperable UTF-8 on the wire, and
>> haven't thought through the bad consequences of their laziness.
>
> Way overstated. UTF-16 and several other Unicode charsets are very
> useful, depending on which protocol. Since UTF-8 is not terribly
> efficient, there is not particular reason to favor it over other Unicode
> charsets when one designs new protocols where ASCII compatibility is
> moot. IMHO.
Time and again people have created obscure binary protocols because they
are more "efficient". Most of these protocols have been huge failures
because they are vastly less efficient when it comes to interoperability
and diagnosability which are usually far more important qualities than the
number of bytes on the wire. The minor space savings of UTF-16 relative to
UTF-8 does not justify the huge loss in interoperability. If space is an
issue apply a general purpose compression algorithm to UTF-8 -- that will
be vastly more efficient than UTF-16 without the loss of interoperability,
auto-detection, or the ability to re-use existing IETF protocol support
code.
The most successful IETF applications protocols have wisely sacrificed
attempts to conserve bytes in exchange for improved diagnosability,
interoperability and backwards-compatibility.
> Remember that UTF-8 was designed to shoehorn Unicode/UCS into Unix file
> systems, nothing more. Where ASCII byte-stream compatibility is not an
> issue, there are Unicode charsets that are more efficient than UTF-8,
> different ones for different uses.
That may be the history, but UTF-8 was designed far better than UTF-16 when
it comes to all aspects of interoperabilty. Thus it should be the
preferred encoding for all transport protocols and all interface points
between systems from different vendors. When UTF-16 is promoted instead of
UTF-8, I consider that detrimental to the deployment of Unicode.
All the UTF-16 APIs in Windows and MacOS are a huge barrier to deployment
of Unicode on those platforms since all the code has to be rewritten (and
most of it never is). If they had instead retro-fitted UTF-8 into the
existing 8-bit APIs we'd have much better Unicode deployment.
- Chris