[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: internationalization/ISO10646 question
begin quotation by Marcin Hanclik on 2002/11/25 21:09 +0100:
> Your explanation means that you cannot send UTF-16 encoding, because it
> cannot preserve CRLF.
> You could not send any unicode characters (apart from UTF-8) in MIME
> then!!!
As Ned said, you can't send UTF-16 in the "text" top-level media type in
MIME (with a notable exception for the HTTP variant of MIME), but you could
use it in an "application/text" mediatype in SMTP and MIME. On the flip
side, why would you want to?
UTF-16 is a terrible encoding for interoperability. There are 3 published
non-interoperable variants of UTF-16 (big-endian, little-endian,
BOM/switch-endian) and only one of the variants can be auto-detected with
any chance of success (and none of them can be auto-detected as well as
UTF-8). It's not a fixed-width encoding, so you don't get the fixed-width
benefits that UCS-4 would provide (unless you ignore a slew of plane-1
characters) and it doesn't have any of the useful characteristics of UTF-8
(nearly complete compatibility with code written to operate on 8-bit
character strings).
So this raises the question: why would any sensible protocol designer ever
what to transport UTF-16 over the wire? There may be a few rare corner
cases where it makes sense, but in general UTF-8 is superior in almost all
instances. I suspect the only reason we see UTF-16 on the wire is because
some programmers are too lazy to convert from an internal variant of UTF-16
to interoperable UTF-8 on the wire, and haven't thought through the bad
consequences of their laziness.
See RFC 2277 -- the IETF has a clear policy recommending UTF-8 with good
reason.
- Chris