[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Are charset names supposed to be case sensitive?



Doug Ewell, Mon, 19 Dec 2011 08:36:10 -0700:

> It seems Leif might be trying to tag the incomplete or erroneous 
> behavior of individual applications, even if they don't correspond to 
> documented behavior, or to tag mis-documented behavior that may not 
> actually be implemented (like "unicode" meaning "BMP only").

* BMP: The motivation behind why the registrations says 'BMP' was only 
that the written spec says so and because the registration template 
asked for such data.

* Products: Reference to products are made in order to document that 
the 'unicode'/'unicodeFFFE' specs actually are implemented. In that 
regard, the possible 'BMP'-incorrectness seems far less important 
w.r.t. practical 'real' problems than the endianness issues.

* Actually implemented: That 'unicode' and 'utf-16' (in the Microsoft 
spec) are names for little-endian UTF-16, while 'unicodeFFFE' is name 
for big-endian UTF-16, is a fact. To verify, try the following web page 
in Chrome, Safari or IE - the clue being that the page is 
'utf-16b'-encoded while HTTP says 'utf-16': 
http://malform.no/testing/utf/html/16be/http.utf16
   For reference, an identical, but little-endian encoded page:
http://malform.no/testing/utf/html/16le/http.utf16
   If IE and Safari/Chrome implemented the official UTF-16 
specification, the first page should have worked fine, while the latter 
perhaps did not need to work. Instead, we see the opposite: The first 
page fails in in the mentioned browsers.

* 'Actually implemented' has reached Web standards: HTML5 specifies: 
«The requirement to default UTF-16 to little-endian rather than 
big-endian is a willful violation of RFC 2781, motivated by a desire 
for compatibility with legacy content. [RFC2781]» 
<http://dev.w3.org/html5/spec/parsing.html#character-encodings-0> 
Whether it is 'legacy content' - as HTML5  claims - or implementation 
of the Microsoft spec - or both things - that makes HTML5 say this, is 
perhaps an open question.

> I'm not sure that's a goal of registering charsets. 

The goals with these registrations are to comply with section 2.5. In 
particular did this seem relevant: «the use of a large number of 
undocumented and/or unlabeled charsets hampers interoperability even 
more.»
<http://tools.ietf.org/html/bcp19#section-2.5>

> It also seemed to 
> me—though I assume I'm wrong here—that he was trying to call 
> particular attention to errors in Microsoft implementations, but I'm 
> sure Shawn and others can speak to that.

It is not only products of Microsoft: Webkit is backed by Apple, 
Google, HTML5 ...

But with Microsoft's positive attitude Unicode, including UTF-16, it 
seems reasonable to ask: Is it certain that Microsoft - and the 
community at large - is aware of how they operate with a shadow spec 
that contradicts UTF-16 - and the impacts of this? Perhaps, with a 
little attention to this, they will update or fine-tune? Here is 
hoping. 
-- 
Leif Halvard Silli