[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: Comments on draft-yergeau-rfc2279bis-00.txt
On 17/04/2002 21:51:19 Francois Yergeau wrote:
> Martin Duerst wrote:
[...]
> > 5. Byte order mark (BOM)
> >
> > This section needs more work. The 'change log' says that it's
> > mostly taken from the UTF-16 RFC. But the BOM for UTF-8 is
> > much less necessary, and much more of a problem, than for UTF-16.
> > We should clearly say that with IETF protocols, character encodings
> > are always either labeled or fixed, and therefore the BOM SHOULD
> > (and MUST at least for small segments) never be used for UTF-8.
> > And we should clearly give the main argument, namely that it
> > breaks US-ASCII compatibility (US-ASCII encoded as UTF-8
> > (without a BOM) stays exactly the same, but US-ASCII encoded
> > as UTF-8 with a BOM is different).
>
> I don't quite see your point. A US-ASCII string, with or without a BOM, is
> always a valid UTF-8 string, I don't see where compatibility is broken. I
> can see that protocols shouldn't *require* a BOM, because then a strict
> (BOM-less) ASCII string wouldn't meet the requirement. But that's not what
> you're saying, right?
The point Martin may be making is that some tools insert a BOM
at the start of a resource they consider to be encoded using
UTF-8, but do not do so for a resource they consider to be
encoded using US-ASCII.
I have just carried out the following test. I opened Notepad
under Win2K and typed the letter "a". I then saved the file,
leaving the default encoding of "ANSI". I then saved the file
again, under a different name, specifying "UTF-8" as the
encoding. I then checked the file sizes using Properties.
The first file is 1 byte long; the second 4 bytes.
Misha
[...]
------------------------------------------------------------- ---
Visit our Internet site at http://www.reuters.com
Any views expressed in this message are those of the individual
sender, except where the sender specifically states them to be
the views of Reuters Ltd.