[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: draft-hoffman-utf16-01.txt available
À 13:21 02/02/99 -0800, medavis2@us.ibm.com a écrit :
>Francois Yergeau <yergeau@alis.com> on 02/02/99 12:34:14 PM
>>Since the problem with BOMs is their ambiguousness -- is it a real BOM or
>>an intended ZWNBSP? -- I currently lean toward a "SHOULD NOT put a BOM"
>>unless it's mandatory (such as in XML), in which case it is also
>>unambiguous.
>
>*** I disagree (if I understand you correctly).
>
>If we have the three labels, then as a sender my role is clear. If the text
>might come from a source that uses BOM (XML file, Windows file) send as
>UTF-16.
Two problems here: 1) use UTF-16 only if you *know* that the source has a
BOM or that it's BE, otherwise you might send LE without a BOM; "might
come" is not enough; not every Windows file will have a BOM, it depends on
the creating application. 2) Are we sure that we want to drop the more
specific BE|LE tags, and thereby force the receiver to peek inside the
object, just because we know there is a BOM? I'm not convinced yet.
> If it doesn't (any other Unicode string!), then I will send
>UTF-16BE/LE (depending on the polarity).
And perhaps generate illegal stuff, because there turns out to be a BOM
after all. I'm questionning the enforceability of forbidding BOMs. Let's
not make a rule that cannot be obeyed by reasonable implementations in most
cases.
>As a receiver, my role is also clear. If I receive UTF-16BE/LE, then any
>initial <FE,FF> is a real ZWNBSP.
...is purported to be a real ZWNBSP.
> If I receive UTF-16, then any initial
><FE,FF> is a BE BOM, any initial <FF,FE> is an LE BOM.
No argument here.
>Let's face it--the BOM is a hack designed to work with systems where text
>streams are untagged. And unfortunately, it also has an equally valid other
>semantic (a price of the merger with 10646, since SC2 objected to having a
>character with only the semantic of the BOM.)
Sigh. Too true. Things would be much simpler if a BOM could only be a
BOM. Too late.
Facing this, there doesn't seem to be any ideal solution, guaranteed to
always work. We have to decide on the best compromise. I think it's
slightly better not to forbid BOMs completely with the BE/LE tags, but I'm
willing to go for forbidding them if that's the consensus. At this point,
the most important thing is to register those $%?&* tags in a workable
manner and make UTF-16 useable on the Internet.
One last thing:
>*** Even if XML did not require a BOM, it would not be unambiguous! Look at
>Appendix F in
>http://www.xml.com/axml/target.html#sec-guessing. The file would just have
>to have the initial '<?xml' like all other encodings.
Not necessarily. The initial '<?xml' is not required for UTF-16 (and
UTF-8) XML entities. Such entities may begin with a comment, white space
or an element start tag. It's the mandated BOM that makes UTF-16 entities
auto-detectable. Requiring the initial '<?xml' would be an incompatible
change to the XML spec, invalidating existing data and applications.
--
François