[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Proposed changes to UTF-8 draft



At 11:23 03/01/10 -0500, Francois Yergeau wrote:
>I wish to propose 2 changes to the UTF-8 draft:
>
>(1) restrict to 4-byte sequences, i.e. remove the 5- and 6-byte sequences
>
>(2) refer normatively to Unicode 3.2
>
>The rationale for (1) is that Unicode is restricted to the 0-10FFFF range of
>code points and therefore 5- and 6-byte sequences cannot occur.  10646 is
>not officially so restricted but has a policy to not encode anything past
>10FFFF and has actually removed Private Use Areas beyond 10FFFF to
>accomodate Unicode.  Another reason is that there is much Fear, Uncertainty
>and Doubt regarding this issue; an example is this mail excerpt received
>this morning on the ietf-822@imc.org list:
>
>Bruce Lilly wrote:
> >  From the point of view of parsing some stream of octets,
> > according to one "utf-8" specification a certain sequence
> > *is* a utf-8 sequence, and according to other "utf-8"
> > specifications is is *not* a utf-8 sequence. I.e. one
> > cannot design a parser to recognize "utf-8" from a sequence
> > of octets unless one specifies *which* of the mutually-incompatible
> > "utf-8" specifications is to be used, viz. whether or not the 5-
> > and 6-byte sequnces are or are not "utf-8".
>
>It seems worthwhile to close that issue once and for all.

Just to be sure: Is a 4-byte sequence that encodes a codepoint
beyond 10FFFF legal in your new version of the draft or not?


Regards,     Martin.