[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: internationalization/ISO10646 question - UTF-16
Chris Newman wrote:
> UTF-16 is a terrible encoding for interoperability. There are 3
Not true, especially if it's declared properly. It is interoperable, and it is at least as compact
as, or more compact than, UTF-8 for all non-Latin texts.
> published non-interoperable variants of UTF-16 (big-endian,
> little-endian, BOM/switch-endian) and only one of the variants can be
Yes, but the variants are minor - endianness and BOM.
> auto-detected with any chance of success (and none of them can be
> auto-detected as well as UTF-8). It's not a fixed-width encoding, so
> you don't get the fixed-width benefits that UCS-4 would provide (unless
Well, few encodings are fixed-width, and some popular encodings are a lot more complicated.
Fixed-width encodings are useful for processing, but this is not an issue for transport.
Exchanging data over a wire in UTF-32/UCS-4 would be crazy. You would knowingly waste at least 33%
and almost always 50% of your bandwidth transmitting 0s, compared with UTF-16.
Besides, UTF-32 has the same 3 variants.
> you ignore a slew of plane-1 characters) and it doesn't have any of the
which occur rarely
> useful characteristics of UTF-8 (nearly complete compatibility with code
> written to operate on 8-bit character strings).
True, but if you use a converter anyway for input/output as you have to do in a MIME world, then you
have to do that for any charset.
> So this raises the question: why would any sensible protocol designer
> ever what to transport UTF-16 over the wire? There may be a few rare
> corner cases where it makes sense, but in general UTF-8 is superior in
> almost all instances. I suspect the only reason we see UTF-16 on the
> wire is because some programmers are too lazy to convert from an
> internal variant of UTF-16 to interoperable UTF-8 on the wire, and
> haven't thought through the bad consequences of their laziness.
Way overstated. UTF-16 and several other Unicode charsets are very useful, depending on which
protocol. Since UTF-8 is not terribly efficient, there is not particular reason to favor it over
other Unicode charsets when one designs new protocols where ASCII compatibility is moot. IMHO.
Remember that UTF-8 was designed to shoehorn Unicode/UCS into Unix file systems, nothing more. Where
ASCII byte-stream compatibility is not an issue, there are Unicode charsets that are more efficient
than UTF-8, different ones for different uses.
Best regards,
markus
--
Opinions expressed here may not reflect my company's positions unless otherwise noted.