[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Revised proposal for UTF-16

To: [email protected]
Subject: Re: Revised proposal for UTF-16
From: "Martin J. Duerst" <[email protected]>
Date: Fri, 24 Jul 1998 14:51:17 +0900
Cc: Dan Kegel <[email protected]>,MURATA Makoto <[email protected]>,Harald Alvestrand <[email protected]>,Chris Newman <[email protected]>, [email protected],[email protected], [email protected]
In-reply-to: <[email protected]>
References: <[email protected]><[email protected]>

At 08:33 98/05/31 -0700, Erik van der Poel wrote:
> Dan Kegel wrote:
> > 
> > In the case of HTTP headers, we can probably consider the
> > entire HTTP header stream as a single message, and only require
> > the BOM at the beginning of the stream, e.g. the client and server
> > would each send the BOM as the first two bytes after opening the
> > socket.
> 
> No, HTTP headers are always encoded with one octet per character, even
> if the body is UCS-2 or UCS-4 (or UTF-16). You would have
> interoperability problems if you tried to send the headers themselves in
> UTF-16. A client could only send UTF-16 headers if it knew beforehand
> that the server could deal with it.

This is not exactly true. HTTP 1.1 for a very rare case (warnings) allows
MIME-encoded (the (in)famous =? ? ? ?= syntax) headers. Other protocols,
in particular email, allow this, too.

I don't think that we should worry about the general problem of what a
hypotetical new protocol will do with its headers and other protocol
elements. The correct way to design such a protocol is to take only
one, UCS-based, character encoding. The "charset" parameter and the
MIME tag "UTF-16" then become irrelevant, even if the protocol should
choose to use UTF-16. It will be the protocol's business to make sure
they get around the big/little-endian issue, and we have to hope that
they do so based on past experience.

I also don't think we should worry about UTF-16 being used raw in the
headers of traditional protocols. UTF-8 provides a much easier upgrade
path for this case, and doesn't have endian problems.

What I think we should worry is whether and how UTF-16 should be used
in traditional protocol headers, based on MIME encoded words. Several
solutions are possible:

- Discourage or disallow UTF-16 in such headers (there are other
  cases, in particular Korean Email, where there are differences
  between the encoding used in the header and in the body).

- Use a different specification for these headers (headers would
  probably be in big-endian without a BOM, and nothing else,
  bodies could tolerate little-endian and/or recommend/mandate
  the BOM). The difference is justified because headers need
  additional encoding/decoding anyway, and the user expectations
  for their legibility are somewhat lower.

- Use exactly the same specifications for both headers and bodies.

Regards,   Martin.

Follow-Ups:
- Re: Revised proposal for UTF-16
  - From: Harald Alvestrand <[email protected]>
- RE: Revised proposal for UTF-16
  - From: Larry Masinter <[email protected]>

References:
- Re: Revised proposal for UTF-16
  - From: MURATA Makoto <[email protected]>
- Re: Revised proposal for UTF-16
  - From: Dan Kegel <[email protected]>
- Re: Revised proposal for UTF-16
  - From: [email protected] (Erik van der Poel)

Prev by Date: Fwd: UTF-16
Next by Date: Re: Revised proposal for UTF-16
Prev by thread: Re: Revised proposal for UTF-16
Next by thread: Re: Revised proposal for UTF-16
Index(es):
- Date
- Thread