[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Comments on draft-yergeau-rfc2279bis-00.txt
Martin asked:
> At 09:12 02/10/17 -0700, Markus Scherer wrote:
> >Patrik F$BgM(Btstr$B‹N(B wrote:
> >
> >>What I hear on this list is that the consensus is that BOM SHOULD NOT be
> >>used. I would like it to be MUST NOT be used in Internet protocols, which
> >>leads to tagged UTF-8 text be illegal if the BOM exists in the text.
> >
> >
> >That would violate the Unicode standard.
>
> Hello Markus,
>
> Can you give the details of why and how (in terms e.g. of conformance
> clauses in the Unicode Standard)?
I think Markus may have overstated the case.
It is certainly possible for an Internet protocol specification
to make BOM-initial UTF-8 text illegal *for that protocol*, if the
relevant protocol definers deem it such. That does not violate
the Unicode Standard, any more than if such a protocol made the
use of the backslash character illegal *for that protocol*.
The Unicode Standard allows BOM-initial UTF-8. The reason it does
so is because encoding conversions should not drop data if
converting between UTF-16 (or UTF-32), which might have an initial
BOM, and UTF-8.
However, the Unicode Standard does not require or recommend the use
of a BOM with UTF-8, since its use as a signature is superfluous
in that encoding form, and as y'all have discussed, if anything, it
is harmful in that context for many protocols and for the ASCII
compatibility of UTF-8 data streams. The exact wording being
considered for the Unicode 4.0 revision is:
"When represented in UTF-8, the byte order mark turns into
the byte sequence <EF BB BF>. Its usage at the beginning of
a UTF-8 data stream is neither required nor recommended by
the Unicode Standard, but its presence is not considered
non-conformant for the UTF-8 encoding scheme."
And then there is a bunch more language a bit later about the
care that is necessary when handling BOM's when converting between
encoding schemes.
--Ken