[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: internationalization/ISO10646 question



begin  quotation by Marcin Hanclik on 2002/11/25 21:09 +0100:
> Your explanation means that you cannot send UTF-16 encoding, because it
> cannot preserve CRLF.
> You could not send any unicode characters (apart from UTF-8) in MIME
> then!!!

As Ned said, you can't send UTF-16 in the "text" top-level media type in 
MIME (with a notable exception for the HTTP variant of MIME), but you could 
use it in an "application/text" mediatype in SMTP and MIME.  On the flip 
side, why would you want to?

UTF-16 is a terrible encoding for interoperability.  There are 3 published 
non-interoperable variants of UTF-16 (big-endian, little-endian, 
BOM/switch-endian) and only one of the variants can be auto-detected with 
any chance of success (and none of them can be auto-detected as well as 
UTF-8).  It's not a fixed-width encoding, so you don't get the fixed-width 
benefits that UCS-4 would provide (unless you ignore a slew of plane-1 
characters) and it doesn't have any of the useful characteristics of UTF-8 
(nearly complete compatibility with code written to operate on 8-bit 
character strings).

So this raises the question: why would any sensible protocol designer ever 
what to transport UTF-16 over the wire?  There may be a few rare corner 
cases where it makes sense, but in general UTF-8 is superior in almost all 
instances.  I suspect the only reason we see UTF-16 on the wire is because 
some programmers are too lazy to convert from an internal variant of UTF-16 
to interoperable UTF-8 on the wire, and haven't thought through the bad 
consequences of their laziness.

See RFC 2277 -- the IETF has a clear policy recommending UTF-8 with good 
reason.

                - Chris