[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Encodings and the web
Anne van Kesteren, Tue, 20 Dec 2011 11:59:49 +0100:
> * More encodings in the registry than needed for the web
> * Error handling for encodings is undefined (can lead to XSS exploits,
> also gives interoperability problems)
> * Often encodings are implemented differently from the standard
Comment: In the HTML5 spec, the term 'character encoding' is used.
Perhaps this document should say the same? At least once ... for
instance in the title ...
Comment: The approach of the 'old' character sets registry is to
document the encodings in use, but not necessarily to endorse them. Do
you follow a similar approach? E.g. do you intend to list all encodings
and encoding labels, including obsolete ones? And if you make things
into aliases which previously were different character sets/encodings,
do you intend to point to the original specs or registrations? I have
the feeling that you take a synchronic approach - gloss over the past.
It appears simpler to contribute if the spec tries to be complete.
For instance, I could not find ISO-IR-111 in your list ... just to name
one character encoding that stuck in my mind ... It is a superset of
KOI8-R.
...
> The goal is to unify encoding handling across user agents for the web so
> legacy pages can be interpreted "correctly" (i.e. as expected by users).
As expected by users, you say. Or as UAs have created the expectations
... Users expect their pages to work. HTML5 says that UTF-32 is
explicitly not supported. And I think 'not supported anymore' should be
documented. I would suggest that the spec ought to take this approach:
W.r.t. 'dubious' encodings, then UAs should be allowed to support any
legacy encoding they like unless it is explicitly listed as 'not
supported'. That way we get to quarrel about what to ban, rather than
about what to welcome.
As for 'users', then I note that you for instance for IBM 864 say
'since Presto has no support, may be we can remove it'? Opera is of
course the dominating browser ... Though I might not understand the
impact of the mobile Web in that statement - Opera mini is pop in
Arabic countries? But to be certain: Where are the users in this line
of thought?
You thereafter say that 'Chromium only supports it because of Webkit'.
How do you know that? In my experience, Chromium appears almost biased
towards Arabic ... E.g. for unlabelled koi8-r, then it defaults to
Arabic ... At least on my computer and on this page - without the same
thing happening in Safari:
<http://www.malform.no/testing/utf/html/koi8/1>.
Personally, I'd like to see more robust detection of UTF-eight - and,
of course - also of UTF-sixteen. As for UTF-eight, then it really ought
be some kind of pre-default, before defaulting to the locale encoding.
(Opera and Chrome are perhaps closest to my wish in that regard.)
Btw, what is this spec's relation to the encoding sniffing algorithm of
HTML5 supposed to be?
And what are 'Encodings and the web'? Does XML fit in there? I think
some would like to say 'hopefully not' ...
> If you are interested in helping out testing (and reverse engineering)
> multi-octet encodings please let me know. Any other input is much
> appreciated as well
As part of my MS 'unicode' effort, I have created a test bed that I try
to update in my perceived spare time:
<http://www.malform.no/testing/utf/>. But it takes some time to analyze
and document it all. However, it is quite interesting ... I will find a
suitable place to post it when I'm ready.
One thing I've found, in that regard, is that browsers vary a good deal
w.r.t. what they use in order to detect encoding. For instance they
vary w.r.t. whether they use the XML prolog, both with and without the
XML encoding inside - including in HTML - when sniffing the encoding.
Chrome does use the XML prolog - at least it sniffs UTF-16LE and
UTF-16BE when the prolog is there, but not necessarily otherwise. If
you - as I think you do - want to eat into how not only HTML but also
XML handles encodings, perhaps HTML should accept being eaten into by
XML too? (I suggested for HTML5 that it should allow limited use of XML
prolog, but guess if the Editor closed that bug ...)
--
Leif H Silli