[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Proposed changes to UTF-8 draft



On Fri, Jan 10, 2003 at 11:23:46AM -0500, Francois Yergeau wrote:
> I wish to propose 2 changes to the UTF-8 draft:
> 
> (1) restrict to 4-byte sequences, i.e. remove the 5- and 6-byte sequences
> 
> (2) refer normatively to Unicode 3.2
> 
> The rationale for (1) is that Unicode is restricted to the 0-10FFFF range of
> code points and therefore 5- and 6-byte sequences cannot occur.  10646 is
> not officially so restricted but has a policy to not encode anything past
> 10FFFF and has actually removed Private Use Areas beyond 10FFFF to
> accomodate Unicode.  Another reason is that there is much Fear, Uncertainty
> and Doubt regarding this issue; an example is this mail excerpt received
> this morning on the ietf-822@imc.org list:

I think you should keep the specification aligned with 10646,
also in the interest in being liberal in what you accept, an old and
good IETF practice.

> The rationale for (2) is that Unicode 3.2 now has a better, stricter
> definition of UTF-8 than 10646.  Specifically, the difference concerns the
> encoding of surrogate code points, in the range D800-DFFF.  10646 only has a
> Note (presumably non-normative) pointing out that the mapping of those code
> points to UTF-8 is undefined; it doesn't make it an error to decode UTF-8 to
> those code points, although it discusses other error cases, and therefore
> opens the door to the dangerous practice of decoding double-surrogate 6-byte
> sequences into a single non-BMP character.  The recent Unicode 3.2 spec of
> UTF-8 clearly and squarely forbids this practice and is therefore, IMHO,
> what the Internet spec of UTF-8 needs.  Using Unicode is also more
> consistent with (1).  10646 could remain as the normative reference for the
> characters themselves.
> 
> Opinions?

I think we should keep ourselves to open standards whenever possible,
and avoid industry standards like Unicode if we can.

10646 is pretty explicit about not using surrogates in UTF-8,
as far as I know. Always was.

Kind regards
keld