[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Comments on draft-yergeau-rfc2279bis-00.txt
Hello Francois,
Many thanks for your very quick work!
Here are my comments on
http://www.ietf.org/internet-drafts/draft-yergeau-rfc2279bis-00.txt.
- I prefer to get the .txt version rather than the .html version
if you send one before publishing. For I-Ds, the .txt is the
real thing.
<1> and some other places
ISO/IEC 10646-1 defines a multi-octet character set called the
Universal Character Set (UCS) which encompasses most of the world's
writing systems. Multi-octet characters, however, are not compatible
with many current applications and protocols, and this has led to the
development of UTF-8, the object of this memo.
While the title of ISO/IEC 10646 includes 'multi-octet', I think
this is confusing, because we want to clearly separate characters,
their numbers in the UCS, and the actual encoding into octets,...
I suggest you remove 'multi-octet' everywhere except for the
formal title in the reference, and if necessary replace it
with something like 'large'.
<13>
o The lexicographic sorting order of strings is preserved. Of
course this is of limited interest since a sort order based on
character numbers is not culturally valid.
'preserved' in respect to what?
<14>
o The Boyer-Moore fast search algorithm can be used with UTF-8 data.
This should be worded more general, at least inserting something
like 'and similar algorithms'.
<15>
o UTF-8 strings can be fairly reliably recognized as such by a
simple algorithm, i.e. the probability that a string of
characters in any other encoding appears as valid UTF-8 is low,
diminishing with increasing string length.
This should maybe somehow mention the special case of an US-ASCII-only
string (which can be easily detected, but...).
<16>
UTF-8 was originally a project of the X/Open Joint
Internationalization Group XOJIG with the objective to specify a File
System Safe UCS Transformation Format [FSS_UTF] that is compatible
with UNIX systems, supporting multilingual text in a single encoding.
The original authors were Gary Miller, Greger Leijonhufvud and John
Entenmann. Later, Ken Thompson and Rob Pike did significant work for
the formal UTF-8.
formal UTF-8 -> formal definition of UTF-8 ?
<20>
In UTF-8, characters are encoded using sequences of 1 to 6 octets.
If the repertoire is restricted to the range U+0000 to U+10FFFF (the
Unicode repertoire)
I don't like the term 'Unicode repertoire'. But I don't have a better
term for the moment, unfortunately.
<25>
3. Fill in the bits marked x from the bits of the character number,
expressed in binary. Start from the lower-order bits of the
character number and put them first in the last octet of the
sequence, then the next to last, etc. until all x bits are
filled in.
This misses one important detail: the sequence in which the bits
are filled into a byte. This should be fixed. Maybe we can
make things even clearer, as follows:
Character number | UTF-8 octet sequence
(binary) | (binary)
-------------------------------------------------------------------------
0000000000000000000000000gfedcba | 0gfedcba
000000000000000000000kjihgfedcba | 110kjihg 10fedcba
0000000000000000ponmlkjihgfedcba | 1110ponm 10lkjihg 10fedcba
00000000000utsrqponmlkjihgfedcba | 11110uts 10rqponm 10lkjihg 10fedcba
000000zyxwvutsrqponmlkjihgfedcba | 111110zy 10xwvuts 10rqponm 10lkjihg
| 10fedcba
0EDCBAzyxwvutsrqponmlkjihgfedcba | 1111110E 10DCBAzy 10xwvuts 10rqponm
| 10lkjihg 10fedcba
<32>
ISO/IEC 10646 is updated from time to time by publication of
amendments and additional parts; similarly, different versions of the
Unicode standard are published over time. Each new version obsoletes
and replaces the previous one, but implementations, and more
significantly data, are not updated instantly.
'different versions' gives the impression that these might be
diverging versions.
<33>
In general, the changes amount to adding new characters, which does
not pose particular problems with old data. Amendment 5 to ISO/IEC
10646, however, has moved and expanded the Korean Hangul block,
As far as I understand, amendments for ISO standards are numbered
separately for each version. So we need to clearly say here that
it is Amendments 5 to 10646:1993. Also, saying when that change
happened (Ken?) will help bringing things in perspective for the
new reader.
thereby making any previous data containing Hangul characters invalid
under the new version. Unicode 2.0 has the same difference from
Unicode 1.1. The official justification for allowing such an
incompatible change was that no implementations and no data
containing Hangul existed, a statement that is likely to be true but
remains unprovable.
As I personally had an implementation as well as some data
(in ET++, so this was also part of Lys), this is provably false.
I propose to change this to "The justification for allowing such an
incompatible change was that there were no major implementations
and no significant amounts of data containing Hangul."
<34>
New versions, and in particular any incompatible changes, have
consequences regarding MIME character encoding labels, to be
discussed in section 5.
'character encoding' -> '"charset"' (I fight against the term
'character set' or 'charset' quite a bit, but here, it's the
right word to use, because that's the name of the parameter.)
'New versions have consequences' sounds a bit strange. What about:
The consequences of versioning on MIME "charset" labels, in
particular in the case of incompatible changes, are discussed
in Section 5.
5. Byte order mark (BOM)
This section needs more work. The 'change log' says that it's
mostly taken from the UTF-16 RFC. But the BOM for UTF-8 is
much less necessary, and much more of a problem, than for UTF-16.
We should clearly say that with IETF protocols, character encodings
are always either labeled or fixed, and therefore the BOM SHOULD
(and MUST at least for small segments) never be used for UTF-8.
And we should clearly give the main argument, namely that it
breaks US-ASCII compatibility (US-ASCII encoded as UTF-8
(without a BOM) stays exactly the same, but US-ASCII encoded
as UTF-8 with a BOM is different).
<35>
The Unicode Standard and ISO 10646 define the character "ZERO WIDTH
NO-BREAK SPACE" (U+FEFF), which is also known informally as "BYTE
ORDER MARK" (abbreviated "BOM"). The latter name hints at a second
possible usage of the character, in addition to its normal use as a
genuine "ZERO WIDTH NO-BREAK SPACE" within text. This usage,
suggested by Unicode section 2.7 and ISO/IEC 10646 Annex H
(informative), is to prepend a U+FEFF character to a stream of
Unicode characters as a "signature"; a receiver of such a serialized
Unicode characters -> UCS characters ?
stream may then use the initial character both as a hint that the
stream consists of Unicode characters, as a way to recognize which
UCS encoding is involved and, with encodings having a multi-octet
encoding unit, as a way to recognize the serialization order of the
octets.
The sentence that ends here is too long. Please split.
UTF-8 having a single-octet encoding unit, this last
function is useless and the BOM will always appear as the octet
sequence EF BB BF.
<40>
The character sequence representing the Hangul characters for the
Korean word "hangugo" (U+D55C, U+AD6D, U+C5B4) is encoded in UTF-8 as
follows:
Please say that this word means Korean (language) in Korean.
And it should probably be spelled hangugeo.
<41>
The character sequence representing the Han characters for the
Japanese word "nihongo" (U+65E5, U+672C, U+8A9E) is encoded in UTF-8
as follows:
Please say that nihongo means Japanese (lanugage).
<42>
The character U+233B4 (a Chinese character meaning 'stump of tree'),
prepended with a UTF-8 BOM, is encoded in UTF-8 as follows:
Please don't give an example of a bad practice.
<43>
This memo is meant to serve as the basis for registration of a MIME
character set parameter (charset) [RFC2978].
Obviously, UTF-8 is already registered. So I would reword this a bit,
maybe starting "This memo serves as the basis for the registration of...".
Then probably add an IANA consideration section where you say:
"Please update the reference for UTF-8 to point to this memo." or so.
8. Security Considerations
- Most of the attacks described have actually taken place.
I think some 'might's and 'could's should be changed so that
it's clearer that these are very realistic threats.
- It might be a good idea, here or somewhere else in the document,
to provide some regular expressions that fully check UTF-8 byte
sequences.
Here is one from the W3C validator, in Perl (because Perl
allows spaces, this is rather readable :-):
s/ [\x00-\x7F] # ASCII
| [\xC2-\xDF] [\x80-\xBF] # non-overlong 2-byte sequences
| \xE0[\xA0-\xBF] [\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte sequences
| \xED[\x80-\x9F] [\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF] [\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3] [\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
//xg;
(this substitutes all legal UTF-8 sequences away; if there
is something left, it's not UTF-8). This is for planes 0-16 only.
Another is the ABNF from the usenet draft:
(http://www.ietf.org/internet-drafts/draft-ietf-usefor-article-06.txt)
UTF8-xtra-2-head= %xC2-DF
UTF8-xtra-3-head= %xE0 %xA0-BF / %xE1-EC %x80-BF /
%xED %x80-9F / %xEE-EF %x80-BF
UTF8-xtra-4-head= %xF0 %x90-BF / %xF1-F7 %x80-BF
UTF8-xtra-5-head= %xF8 %x88-BF / %xF9-FB %x80-BF
UTF8-xtra-6-head= %xFC %x84-BF / %xFD %x80-BF
UTF8-xtra-tail = %x80-BF
UTF8-xtra-char = UTF8-xtra-2-head 1( UTF8-xtra-tail ) /
UTF8-xtra-3-head 1( UTF8-xtra-tail ) /
UTF8-xtra-4-head 2( UTF8-xtra-tail ) /
UTF8-xtra-5-head 3( UTF8-xtra-tail ) /
UTF8-xtra-6-head 4( UTF8-xtra-tail )
This doesn't yet include US-ASCII. Either of them probably
needs a bit of work. This is for up to 31 bytes.
<59>
The encoding of your name and address, and Alain's and my name,
is messed up. Please don't try to smuggle something around the I-D editor;
it's not guaranteed to work.
Regards, Martin.