[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Are charset names supposed to be case sensitive?



Bjoern Hoehrmann, Tue, 20 Dec 2011 03:07:34 +0100:

> Reasons for why a label is problematic should be part of the registry,
> information on how certain browsers handle a certain name in the <meta>
> element in the process of detecting the encoding of a HTML document
> should not be. Right now I have trouble telling how to implement the
> two encodings you would like to register. What I would do is probably
> using my http://search.cpan.org/dist/Win32-MultiLanguage/ module to
> convert from the encodings to UTF-8 and look at the results, like if
> a "BOM" matters, how surrogates are handled, and so on. With test data
> you could then say this is how stuff works independently of HTML. If
> there are any issues with that, say things are different from how you
> handle UTF-16/LE/BE, that would be useful aswell.

If it helps, this my 'test bed': <http://malform.no/testing/utf/>. But 
it isn't ready yet - especially the first column, with KOI8-R. And 
there are more test that could be added. As you can see, I only focus 
on HTML and XML. Though I also had a brief look at plain text - it 
seemed like at least IE did not accept UTF-16 as plain text.

The reason why there are so many tests - and some more to come - is 
exactly because I wanted to check for false positives/negatives etc - 
things that aren't what they seem to be. So for instance Webkit seems 
run some encoding sniffing against the XML declaration, both for HTML 
and XML. 

The weirdest thing I have discovered is that IE sniffs *un-labelled* 
UTF-16BE just fine *when served on my computer* but not when served 
from the above web site. I tried to check the HTTP headers, but could 
not spot any things that should have mattered.

Right now I am not in front of my Windows computer, but it seems that 
IE9 in XML mode, is much better at coping with different flavors of 
UTF-16 than it is at handling the same in HTML. My suspicion is that 
this is primarily due to the nature of XML and not because it doesn't 
implement MS 'unicode' and MS 'uncodeFFFE' in XML mode.

Speaking about XML, then there is the issue that an XML parser has got 
to *know* the encoding label, or else it is supposed to be a fatal 
error. So for xmllint spits out a fatal error in front of <?xml 
version='1.0' encoding='unicode' ?> - but web browsers do not do that.  
But Firefox does react if it comes via HTTP's charset parameter.

> How HTML implementations might treat the labels, or whether somone may
> or not want to implement the encoding, and other things like that, are
> secondary and should be looked at when the definition of the encoding,
> or perhaps the difficulties in defining the label, are clear.

I guess that makes sense, yes.
 
>> Meanwhile, perhaps my new version of the 'unicode' registration looks 
>> better?
> 
> You lost me at
> 
>       The 'unicode' spec defines 'utf-16' as its alias, but this of
>       course contradicts with 'utf-16' as defined in the IANA registry.
> 
> already. I can't tell for instance whether this would be still true if
> the label would be registered as you propose.

You have a point there. In the first iteration, I answered a firmly 'no 
aliases'. Nevertheless, 'utf-16' is seen as an alias by the mentioned 
browsers - and perhaps even by HTML5? So I agree I must add back the 
firm 'no aliases'.

Other reactions?
-- 
Leif H Silli