[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Are charset names supposed to be case sensitive?
Doug Ewell, Mon, 19 Dec 2011 08:36:10 -0700:
> It seems Leif might be trying to tag the incomplete or erroneous
> behavior of individual applications, even if they don't correspond to
> documented behavior, or to tag mis-documented behavior that may not
> actually be implemented (like "unicode" meaning "BMP only").
* BMP: The motivation behind why the registrations says 'BMP' was only
that the written spec says so and because the registration template
asked for such data.
* Products: Reference to products are made in order to document that
the 'unicode'/'unicodeFFFE' specs actually are implemented. In that
regard, the possible 'BMP'-incorrectness seems far less important
w.r.t. practical 'real' problems than the endianness issues.
* Actually implemented: That 'unicode' and 'utf-16' (in the Microsoft
spec) are names for little-endian UTF-16, while 'unicodeFFFE' is name
for big-endian UTF-16, is a fact. To verify, try the following web page
in Chrome, Safari or IE - the clue being that the page is
'utf-16b'-encoded while HTTP says 'utf-16':
http://malform.no/testing/utf/html/16be/http.utf16
For reference, an identical, but little-endian encoded page:
http://malform.no/testing/utf/html/16le/http.utf16
If IE and Safari/Chrome implemented the official UTF-16
specification, the first page should have worked fine, while the latter
perhaps did not need to work. Instead, we see the opposite: The first
page fails in in the mentioned browsers.
* 'Actually implemented' has reached Web standards: HTML5 specifies:
«The requirement to default UTF-16 to little-endian rather than
big-endian is a willful violation of RFC 2781, motivated by a desire
for compatibility with legacy content. [RFC2781]»
<http://dev.w3.org/html5/spec/parsing.html#character-encodings-0>
Whether it is 'legacy content' - as HTML5 claims - or implementation
of the Microsoft spec - or both things - that makes HTML5 say this, is
perhaps an open question.
> I'm not sure that's a goal of registering charsets.
The goals with these registrations are to comply with section 2.5. In
particular did this seem relevant: «the use of a large number of
undocumented and/or unlabeled charsets hampers interoperability even
more.»
<http://tools.ietf.org/html/bcp19#section-2.5>
> It also seemed to
> me—though I assume I'm wrong here—that he was trying to call
> particular attention to errors in Microsoft implementations, but I'm
> sure Shawn and others can speak to that.
It is not only products of Microsoft: Webkit is backed by Apple,
Google, HTML5 ...
But with Microsoft's positive attitude Unicode, including UTF-16, it
seems reasonable to ask: Is it certain that Microsoft - and the
community at large - is aware of how they operate with a shadow spec
that contradicts UTF-16 - and the impacts of this? Perhaps, with a
little attention to this, they will update or fine-tune? Here is
hoping.
--
Leif Halvard Silli