[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Proposed changes to UTF-8 draft



Hi Francois,

I agree with your proposal on both (1) and (2) below.

I do NOT agree that the IETF "be forgiving in what you accept"
should be applied to the 5/6-byte UTF-8 character question, as
has been suggested.

Cheers,
- Ira McDonald
  High North Inc


-----Original Message-----
From: Francois Yergeau [mailto:FYergeau@alis.com]
Sent: Friday, January 10, 2003 10:24 AM
To: ietf-charsets@iana.org
Subject: Proposed changes to UTF-8 draft


I wish to propose 2 changes to the UTF-8 draft:

(1) restrict to 4-byte sequences, i.e. remove the 5- and 6-byte sequences

(2) refer normatively to Unicode 3.2

The rationale for (1) is that Unicode is restricted to the 0-10FFFF range of
code points and therefore 5- and 6-byte sequences cannot occur.  10646 is
not officially so restricted but has a policy to not encode anything past
10FFFF and has actually removed Private Use Areas beyond 10FFFF to
accomodate Unicode.  Another reason is that there is much Fear, Uncertainty
and Doubt regarding this issue; an example is this mail excerpt received
this morning on the ietf-822@imc.org list:

Bruce Lilly wrote:
>  From the point of view of parsing some stream of octets, 
> according to one "utf-8" specification a certain sequence
> *is* a utf-8 sequence, and according to other "utf-8"
> specifications is is *not* a utf-8 sequence. I.e. one
> cannot design a parser to recognize "utf-8" from a sequence
> of octets unless one specifies *which* of the mutually-incompatible
> "utf-8" specifications is to be used, viz. whether or not the 5-
> and 6-byte sequnces are or are not "utf-8".

It seems worthwhile to close that issue once and for all.


The rationale for (2) is that Unicode 3.2 now has a better, stricter
definition of UTF-8 than 10646.  Specifically, the difference concerns the
encoding of surrogate code points, in the range D800-DFFF.  10646 only has a
Note (presumably non-normative) pointing out that the mapping of those code
points to UTF-8 is undefined; it doesn't make it an error to decode UTF-8 to
those code points, although it discusses other error cases, and therefore
opens the door to the dangerous practice of decoding double-surrogate 6-byte
sequences into a single non-BMP character.  The recent Unicode 3.2 spec of
UTF-8 clearly and squarely forbids this practice and is therefore, IMHO,
what the Internet spec of UTF-8 needs.  Using Unicode is also more
consistent with (1).  10646 could remain as the normative reference for the
characters themselves.

Opinions?

-- 
François Yergeau