[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Comments on draft-yergeau-rfc2279bis-00.txt

To: [email protected], [email protected], [email protected]
Subject: Re: Comments on draft-yergeau-rfc2279bis-00.txt
From: Dan Oscarsson <[email protected]>
Date: Wed, 17 Apr 2002 13:44:07 +0200
Sender: [email protected]

Martin Duerst wrote:

>Here are my comments on
>http://www.ietf.org/internet-drafts/draft-yergeau-rfc2279bis-00.txt.
>

>5. Byte order mark (BOM)
>
>This section needs more work. The 'change log' says that it's
>mostly taken from the UTF-16 RFC. But the BOM for UTF-8 is
>much less necessary, and much more of a problem, than for UTF-16.
>We should clearly say that with IETF protocols, character encodings
>are always either labeled or fixed, and therefore the BOM SHOULD
>(and MUST at least for small segments) never be used for UTF-8.
>And we should clearly give the main argument, namely that it
>breaks US-ASCII compatibility (US-ASCII encoded as UTF-8
>(without a BOM) stays exactly the same, but US-ASCII encoded
>as UTF-8 with a BOM is different).

Just what I have been thinking about. While it could have been nice to
require all UTF-8 encoded files to have a "magic" marker in the
beginning
to separate between other 8-bit character sets, we are already
past the stage where that could be introduced. So I recommend that
BOM MUST never be used in UTF-8. So you never have to expect it
when handling UTF-8.

I would also very much like UTF-8 to require that Unicode
normalisation form C has been used on the UCS encoded.
Otherwise can the same character sequence have
different UTF-8 codings.
While it is no problem to use overlong UTF-8 sequences, they
are forbidden in the document. This makes it impossible to
encode the same ASCII character sequence in several ways.
The same should be applied to all characters in UCS - only
one form should be allowed.
As form C do not destroy any data and is most compact, it is
the best choice.
So UTF-8 should REQUIRE the characters to be normalised
using form C. (note: text normalised using from KC will
work also, it it is normalised using form C it will result
in the same text).

Having both BOM removed and form C required will make handling
of UTF-8 in software much simpler as well as less error and security
prone.

    Dan

Follow-Ups:
- Re: Comments on draft-yergeau-rfc2279bis-00.txt
  - From: Martin Duerst <[email protected]>

Prev by Date: Re: RFC 2279 (UTF-8) to Full Standard
Next by Date: Re: Fixing redirects for 'character-sets' directory
Prev by thread: RE: Comments on draft-yergeau-rfc2279bis-00.txt
Next by thread: Re: Comments on draft-yergeau-rfc2279bis-00.txt
Index(es):
- Date
- Thread