[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
New draft-yergeau-rfc2279bis-02.txt
- To: ietf-charsets@iana.org
- Subject: New draft-yergeau-rfc2279bis-02.txt
- From: Francois Yergeau <FYergeau@alis.com>
- Date: Thu, 10 Oct 2002 00:09:14 -0400
- Original-recipient: rfc822;ned+ietf-charsets@mrochek.com
- Spam-test: False ; 1.1 / 5.2
Just submitted. Apart from date and filename, the only changes are in
section 6 "Byte Order Mark". They are extensive, in an attempt to
accomodate all the comments on the BOM. Here's the new section 6:
6. Byte order mark (BOM)
<36>
The UCS character U+FEFF "ZERO WIDTH NO-BREAK SPACE" is also known
informally as "BYTE ORDER MARK" (abbreviated "BOM"). This character
can be used as a genuine "ZERO WIDTH NO-BREAK SPACE" within text, but
the BOM name hints at a second possible usage of the character: to
prepend a U+FEFF character to a stream of UCS characters as a
"signature". A receiver of such a serialized stream may then use the
initial character as a hint that the stream consists of UCS
characters and also to recognize which UCS encoding is involved and,
with encodings having a multi-octet encoding unit, as a way to
recognize the serialization order of the octets. UTF-8 having a
single-octet encoding unit, this last function is useless and the BOM
will always appear as the octet sequence EF BB BF.
<37>
It is important to understand that the character U+FEFF appearing at
any position other than the beginning of a stream MUST be interpreted
with the semantics for the zero-width non-breaking space, and MUST
NOT be interpreted as a signature. When interpreted as a signature,
the Unicode standard suggests than an initial U+FEFF character may be
stripped before processing the text. Such stripping is necessary in
some cases (e.g. when concatenating two strings, because otherwise
the resulting string may contain an unintended "ZERO WIDTH NO-BREAK
SPACE" at the connection point), but might affect an external process
at a different layer (such as a digital signature or a count of the
characters) that is relying on the presence of all characters in the
stream. It is therefore RECOMMENDED to avoid stripping an initial
U+FEFF interpreted as a signature without a good reason, to ignore it
instead of stripping it when appropriate (such as for display) and to
strip it only when really necessary.
<38>
U+FEFF in the first position of a stream MAY be interpreted as a
zero-width non-breaking space, and is not always a signature. In an
attempt at diminishing this uncertainty, Unicode 3.2 adds a new
character, U+2060 "WORD JOINER", with exactly the same semantics and
usage as U+FEFF except for the signature function, and strongly
recommends its exclusive use for expressing word-joining semantics.
Eventually, following this recommendation will make it all but
certain that any initial U+FEFF is a signature, not an intended "ZERO
WIDTH NO-BREAK SPACE".
<39>
In the meantime, the uncertainty unfortunately remains and may affect
Internet protocols. Protocol specifications MAY restrict usage of
U+FEFF as a signature in order to reduce or eliminate the potential
ill effects of this uncertainty. In the interest of striking a
balance between the advantages (reduction of uncertainty) and
drawbacks (loss of the signature function) of such restrictions, it
is useful to distinguish a few cases:
<40>
o A protocol SHOULD forbid use of U+FEFF as a signature for those
textual protocol elements that the protocol mandates to be always
UTF-8, the signature function being totally useless in those
cases.
<41>
o A protocol SHOULD also forbid use of U+FEFF as a signature for
those textual protocol elements for which the protocol provides
character encoding identification mechanisms, when it is expected
that implementations of the protocol will be in a position to
always use the mechanisms properly. This will be the case when
the protocol elements are maintained tightly under the control of
the implementation from the time of their creation to the time of
their (properly labelled) transmission.
<42>
o A protocol SHOULD NOT forbid use of U+FEFF as a signature for
those textual protocol elements for which the protocol does not
provide character encoding identification mechanisms, when a ban
would be unenforceable, or when it is expected that
implementations of the protocol will not be in a position to
always use the mechanisms properly. The latter two cases are
likely to occur with larger protocol elements such as MIME
entities, especially when implementations of the protocol will
obtain such entities from file systems, from protocols that do not
have encoding identification mechanisms for payloads (such as FTP)
or from other protocols that do not guarantee proper
identification of character encoding (such as HTTP).
<43>
When a protocol forbids use of U+FEFF as a signature for a certain
protocol element, then any initial U+FEFF in that protocol element
MUST be interpreted as a "ZERO WIDTH NO-BREAK SPACE". When a
protocol does NOT forbid use of U+FEFF as a signature for a certain
protocol element, then implementations SHOULD be prepared to handle a
signature in that element and react appropriately: using the
signature to identify the character encoding as necessary and
stripping or ignoring the signature as appropriate.
--
François Yergeau
Alis Technologies inc.
+1 514 747 2547