[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: internationalization/ISO10646 question

To: Marcin Hanclik <mhanclik@poczta.onet.pl>
Subject: RE: internationalization/ISO10646 question
From: Chris Newman <Chris.Newman@Sun.COM>
Date: Fri, 06 Dec 2002 13:13:41 -0800
Cc: ietf-charsets@iana.org
In-reply-to: <OLENIGGFKBOAIMPONAAJKEEPCDAA.mhanclik@poczta.onet.pl>
Original-recipient: rfc822;ned+ietf-charsets@mrochek.com
References: <OLENIGGFKBOAIMPONAAJKEEPCDAA.mhanclik@poczta.onet.pl>
Spam-test: False ; -102.3 / 5.2

begin  quotation by Marcin Hanclik on 2002/11/25 21:09 +0100:
> Your explanation means that you cannot send UTF-16 encoding, because it
> cannot preserve CRLF.
> You could not send any unicode characters (apart from UTF-8) in MIME
> then!!!

As Ned said, you can't send UTF-16 in the "text" top-level media type in 
MIME (with a notable exception for the HTTP variant of MIME), but you could 
use it in an "application/text" mediatype in SMTP and MIME.  On the flip 
side, why would you want to?

UTF-16 is a terrible encoding for interoperability.  There are 3 published 
non-interoperable variants of UTF-16 (big-endian, little-endian, 
BOM/switch-endian) and only one of the variants can be auto-detected with 
any chance of success (and none of them can be auto-detected as well as 
UTF-8).  It's not a fixed-width encoding, so you don't get the fixed-width 
benefits that UCS-4 would provide (unless you ignore a slew of plane-1 
characters) and it doesn't have any of the useful characteristics of UTF-8 
(nearly complete compatibility with code written to operate on 8-bit 
character strings).

So this raises the question: why would any sensible protocol designer ever 
what to transport UTF-16 over the wire?  There may be a few rare corner 
cases where it makes sense, but in general UTF-8 is superior in almost all 
instances.  I suspect the only reason we see UTF-16 on the wire is because 
some programmers are too lazy to convert from an internal variant of UTF-16 
to interoperable UTF-8 on the wire, and haven't thought through the bad 
consequences of their laziness.

See RFC 2277 -- the IETF has a clear policy recommending UTF-8 with good 
reason.

                - Chris

Follow-Ups:
- Re: internationalization/ISO10646 question - UTF-16
  - From: Markus Scherer <markus.scherer@jtcsv.com>
- Re: internationalization/ISO10646 question
  - From: MURATA Makoto <murata@hokkaido.email.ne.jp>

References:
- RE: internationalization/ISO10646 question
  - From: Marcin Hanclik <mhanclik@poczta.onet.pl>

Prev by Date: RE: internationalization/ISO10646 question
Next by Date: Re: Proposal for additional Aliases to IANA registry of character sets
Prev by thread: RE: internationalization/ISO10646 question
Next by thread: Re: internationalization/ISO10646 question - UTF-16
Index(es):
- Date
- Thread