[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: internationalization/ISO10646 question - UTF-16

To: charsets <ietf-charsets@iana.org>
Subject: Re: internationalization/ISO10646 question - UTF-16
From: Markus Scherer <markus.scherer@jtcsv.com>
Date: Thu, 19 Dec 2002 14:03:12 -0800
Organization: IBM
Original-recipient: rfc822;ned+ietf-charsets@mrochek.com
References: <OLENIGGFKBOAIMPONAAJKEEPCDAA.mhanclik@poczta.onet.pl><2147483647.1039180421@nifty-jr.west.sun.com>
Spam-test: False ; 1.1 / 5.2
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.0.1)Gecko/20020823 Netscape/7.0 (nscd2)

Chris Newman wrote:
> UTF-16 is a terrible encoding for interoperability.  There are 3 

Not true, especially if it's declared properly. It is interoperable, and it is at least as compact 
as, or more compact than, UTF-8 for all non-Latin texts.

> published non-interoperable variants of UTF-16 (big-endian, 
> little-endian, BOM/switch-endian) and only one of the variants can be 

Yes, but the variants are minor - endianness and BOM.

> auto-detected with any chance of success (and none of them can be 
> auto-detected as well as UTF-8).  It's not a fixed-width encoding, so 
> you don't get the fixed-width benefits that UCS-4 would provide (unless 

Well, few encodings are fixed-width, and some popular encodings are a lot more complicated. 
Fixed-width encodings are useful for processing, but this is not an issue for transport.

Exchanging data over a wire in UTF-32/UCS-4 would be crazy. You would knowingly waste at least 33% 
and almost always 50% of your bandwidth transmitting 0s, compared with UTF-16.
Besides, UTF-32 has the same 3 variants.

> you ignore a slew of plane-1 characters) and it doesn't have any of the 

which occur rarely

> useful characteristics of UTF-8 (nearly complete compatibility with code 
> written to operate on 8-bit character strings).

True, but if you use a converter anyway for input/output as you have to do in a MIME world, then you 
have to do that for any charset.

> So this raises the question: why would any sensible protocol designer 
> ever what to transport UTF-16 over the wire?  There may be a few rare 
> corner cases where it makes sense, but in general UTF-8 is superior in 
> almost all instances.  I suspect the only reason we see UTF-16 on the 
> wire is because some programmers are too lazy to convert from an 
> internal variant of UTF-16 to interoperable UTF-8 on the wire, and 
> haven't thought through the bad consequences of their laziness.

Way overstated. UTF-16 and several other Unicode charsets are very useful, depending on which 
protocol. Since UTF-8 is not terribly efficient, there is not particular reason to favor it over 
other Unicode charsets when one designs new protocols where ASCII compatibility is moot. IMHO.

Remember that UTF-8 was designed to shoehorn Unicode/UCS into Unix file systems, nothing more. Where 
ASCII byte-stream compatibility is not an issue, there are Unicode charsets that are more efficient 
than UTF-8, different ones for different uses.

Best regards,
markus

-- 
Opinions expressed here may not reflect my company's positions unless otherwise noted.

Follow-Ups:
- Re: internationalization/ISO10646 question - UTF-16
  - From: Chris Newman <Chris.Newman@Sun.COM>
- Re: internationalization/ISO10646 question - UTF-16
  - From: Keld Jørn Simonsen <keld@dkuug.dk>

References:
- RE: internationalization/ISO10646 question
  - From: Marcin Hanclik <mhanclik@poczta.onet.pl>
- RE: internationalization/ISO10646 question
  - From: Chris Newman <Chris.Newman@Sun.COM>

Prev by Date: Re: Proposal for additional Aliases to IANA registry of character sets
Next by Date: Re: internationalization/ISO10646 question - UTF-16
Prev by thread: RE: internationalization/ISO10646 question
Next by thread: Re: internationalization/ISO10646 question - UTF-16
Index(es):
- Date
- Thread