[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: internationalization/ISO10646 question - UTF-16



Chris Newman wrote:
> UTF-16 is a terrible encoding for interoperability.  There are 3 

Not true, especially if it's declared properly. It is interoperable, and it is at least as compact 
as, or more compact than, UTF-8 for all non-Latin texts.

> published non-interoperable variants of UTF-16 (big-endian, 
> little-endian, BOM/switch-endian) and only one of the variants can be 

Yes, but the variants are minor - endianness and BOM.

> auto-detected with any chance of success (and none of them can be 
> auto-detected as well as UTF-8).  It's not a fixed-width encoding, so 
> you don't get the fixed-width benefits that UCS-4 would provide (unless 

Well, few encodings are fixed-width, and some popular encodings are a lot more complicated. 
Fixed-width encodings are useful for processing, but this is not an issue for transport.

Exchanging data over a wire in UTF-32/UCS-4 would be crazy. You would knowingly waste at least 33% 
and almost always 50% of your bandwidth transmitting 0s, compared with UTF-16.
Besides, UTF-32 has the same 3 variants.

> you ignore a slew of plane-1 characters) and it doesn't have any of the 

which occur rarely

> useful characteristics of UTF-8 (nearly complete compatibility with code 
> written to operate on 8-bit character strings).

True, but if you use a converter anyway for input/output as you have to do in a MIME world, then you 
have to do that for any charset.

> So this raises the question: why would any sensible protocol designer 
> ever what to transport UTF-16 over the wire?  There may be a few rare 
> corner cases where it makes sense, but in general UTF-8 is superior in 
> almost all instances.  I suspect the only reason we see UTF-16 on the 
> wire is because some programmers are too lazy to convert from an 
> internal variant of UTF-16 to interoperable UTF-8 on the wire, and 
> haven't thought through the bad consequences of their laziness.

Way overstated. UTF-16 and several other Unicode charsets are very useful, depending on which 
protocol. Since UTF-8 is not terribly efficient, there is not particular reason to favor it over 
other Unicode charsets when one designs new protocols where ASCII compatibility is moot. IMHO.

Remember that UTF-8 was designed to shoehorn Unicode/UCS into Unix file systems, nothing more. Where 
ASCII byte-stream compatibility is not an issue, there are Unicode charsets that are more efficient 
than UTF-8, different ones for different uses.

Best regards,
markus

-- 
Opinions expressed here may not reflect my company's positions unless otherwise noted.