[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: draft-hoffman-utf16-01.txt available
At 19:11 98/12/29 -0800, Paul Hoffman / IMC wrote:
> I didn't see the announcement on this list, but I could have missed it.
> draft-hoffman-utf16-01.txt is now available; a copy is at
> <http://www.imc.org/draft-hoffman-utf16>. Based on comments from many
> people, this draft has some significant changes throughout. Please let me
> know what you think.
<http://www.ietf.org/internet-drafts/draft-hoffman-utf16-01.txt>
Hello Paul,
Here are my comments. In general, I think the draft has been
siginficantly improved.
- Throughout the whole text, "UTF-16" has two meanings, one
as a way to get from a sequence of numbers in a certain range
to a sequence of 16-bit values and back, and the other as
a "charset" value and the related character encoding.
In this sense, already the title is not very clear. Please
make it absolutely clear thoughout the draft of which thing
you speak every time "UTF-16" is mentionned.
- Abstract: There is no abstract. The abstract is what usually
appears in the ID/RFC anouncements. Is that a problem?
- 1. Intro: The first paragraph is crucial (it may be used as a
replacement for the abstract).
"This document specifies" -> "This document describes"
(the specification is elsewhere, I guess).
"UTF-16BE, UTF-16LE, and UTF-16" -> "UTF-16BE (big-endian),
UTF-16LE (little-endian), and UTF-16" (it's very important
to make sure people, in particular accidental readers, get
the meaning of these acronyms early on!
- 1.1 Background:
- Reference RFC 2130 for the definitions of CCS, CES.
- "UTF-16 ... is a character encoding scheme": This is
jumping a bit too fast. If we say charset == CCS(s) + CES
(which is an understanding that is widely used), there
are two or three CES.
- 1.2 Motivation:
- "UTF-8 imposes a space penalty for characters ... greater
than 0x800.": Not true. Correctly, it would say "UTF-8
imposes a space penalty for characters ... between
0x800 and 0xFFFF" (planes 1-16 take 4 bytes in both cases).
- "Also, characters in UTF-8 have varying sizes", "mostly
uniform in size": There is some truth there, and indeed
in some cases an advantage, but the way it's worded gives
more the impression of contradiction and unclearness than
anything else.
- "Note, however, that UTF-8 has many other advantages...":
You may cite my presentation on the properties of UTF-8
a few Unicode conferences ago. I don't have the reference
handy here (I'm over Siberia now :-), but you can find it
from my home page at http://www.w3.org/People/D%C3%BCrst.
- 2.1/2.2 Encoding/Decoding: The first sentence of each section
should include "from ... to ...", because you are only dealing
with one of two steps (and the one that is less important,
because there is less danger of people getting it wrong).
- 3.1 Definition of big-endian and little-endian.
- "two-octet entities" -> "16-bit entities".
- Please insert something like "On the other hand" before
the sixth line of the first paragraph, to make sure the
reader understands that there is a slight context switch.
- "Most modern hardware": This has been discussed here and
elsewhere more than enough, but the above wording is
unclear, has a high potential for not being true (depending
on the interpretation of "modern" and "hardware"), and
also has a potential to look quite a bit strange in a few
years. I would suggets to change it to "Most current PC hardware".
With this, you are on the safe side, and the reader gets the
right message.
- Maybe the example 0x0102 can be improved by choosing a different
number that doesn't lead to trivial byte values?
- "It is important ... that the beginnig of a stream ...":
Does the MIME spec speak about streams? If not, I would
propose to avoid that word.
- "The contrapositive ... is not always true ...": In general,
you are right. But this spec should contain rules for when a
0xFFFE is a BOM, and when it's a ZWNBSP, otherwise we are not
interoperable, and this sentence should contain a pointer to
these rules.
- "Also, some specifications mandate an initial 0xFFFE character
... and specify that this signature is not to be removed.":
All this stuff is blurring the architectural boundary between
the character layer and the application layer. This is a bad
idea.
- The text here should clearly say that the use of a BOM at the
beginning may result in desinchronization of various parts of
the system. For mail, that's maybe not so much of a problem,
but more for other systems (e.g. HTTP) that are also based on
MIME.
- 3.2/3.3 Serialization in UTF-16BE and UTF-16LE
You treat the BOM only on the receiver side. What are senders
supposed to do, put one in or not? My strong suggestion is to
say that the MUST NOT put one in. Note that for BE and LE, we
can still define things the way we think it's best, and we
should use that chance.
- 3.4 Serialization in UTF-16
"All applications that process text in the "UTF-16" charset
MUST be able to read at least the first two octets ...": This
sounds like it is permitted for a recipient to read the first
two bytes, decide e.g. that it is LE, and then stop because LE
is not supported. Is this what you intend to say? Or what?
4. Choosing a charset
- The MUST to use BE/LE if the encoding is known makes very
much sense in general. And the "and knows the serialization
of the characters in the text" is superfluous, because given
the rules for encoding of "UTF-16", it's rather impossible
to do anything without knowing that.
However, given that some software already accepts "UTF-16",
but doesn't know "UTF-16BE" and "UTF-16LE", just simply using
a MUST may not be appropriate.
- "should be sure the text is labelled" -> "should make sure ..."
- "Text-creating programs": This assumes a simple model of a
mail software with some built-in or connected text editor.
This should be reworded to apply to a wider range of models.
5. Examples
- I suggest you remove (or label as inappropriate) the example
of UTF-16BE with a BOM, and put in examples of big-endian
with and without a BOM for "UTF-16".
6. Versions
- "Each new version obsoletes and replaces": I think the
word "obsoletes" is much too strong here.
- "no implementations and no data ... likely to be true":
This is definetly false! I had (and still have somewhere)
an implementation of the old Hangul encoding and some data
such as keyboard tables and test data.
- There is some overlap between "6. Versions" and the appendix.
I suggest that you have a look at it again, decides what
needs to go where, and clearly separate things.
7. Security considerations
- "surrogate pairs that have illegal values": This is not
the only case of illegal values, a single surrogate is
also illegal.
- I'm still working on checking the conversion code (architecture-
independent writeout of BE or LE); I'll come back to that as soon
as possible. I think it's very important that we provide this
to help people write more stable code.
Hope this helps. Looking forwards to your questions
and comments,
Regards, Martin.
#-#-# Martin J. Du"rst, World Wide Web Consortium
#-#-# mailto:duerst@w3.org http://www.w3.org