[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Encodings and the web

To: Anne van Kesteren <annevk@opera.com>
Subject: Re: Encodings and the web
From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
Date: Wed, 21 Dec 2011 16:55:30 +0100
Cc: ietf-charsets <ietf-charsets@iana.org>
In-reply-to: <op.v6sbhzwi64w2qv@annevk-macbookpro.local>
List-Id: <ietf-charsets.mail.apps.ietf.org>
List-Owner: <mailto:ietf-charsets-owner@mail.apps.ietf.org>
List-Subscribe: <mailto:mailserv@mail.apps.ietf.org?subject=subscribe%20ietf-charsets>
List-Unsubscribe: <mailto:mailserv@mail.apps.ietf.org?subject=unsubscribe%20ietf-charsets>
Organization: =?utf-8?B?TcOlbGZvcm0ubm8=?=
Original-recipient: rfc822;ned+ietf-charsets@mrochek.com
References: <op.v6sbhzwi64w2qv@annevk-macbookpro.local>
Spam-test: False ; 0.0 / 4.5

Anne van Kesteren, Tue, 20 Dec 2011 11:59:49 +0100:

> * More encodings in the registry than needed for the web
> * Error handling for encodings is undefined (can lead to XSS exploits,
>   also gives interoperability problems)
> * Often encodings are implemented differently from the standard

Comment: In the HTML5 spec, the term 'character encoding' is used. 
Perhaps this document should say the same? At least once ... for 
instance in the title ...

Comment: The approach of the 'old' character sets registry is to 
document the encodings in use, but not necessarily to endorse them. Do 
you follow a similar approach? E.g. do you intend to list all encodings 
and encoding labels, including obsolete ones? And if you make things 
into aliases which previously were different character sets/encodings, 
do you intend to point to the original specs or registrations? I have 
the feeling that you take a synchronic approach - gloss over the past. 
It appears simpler to contribute if the spec tries to be complete.

For instance, I could not find ISO-IR-111 in your list ... just to name 
one character encoding that stuck in my mind ... It is a superset of 
KOI8-R. 

 ...
> The goal is to unify encoding handling across user agents for the web so
> legacy pages can be interpreted "correctly" (i.e. as expected by users).

As expected by users, you say. Or as UAs have created the expectations 
... Users expect their pages to work. HTML5 says that UTF-32 is 
explicitly not supported. And I think 'not supported anymore' should be 
documented. I would suggest that the spec ought to take this approach: 
W.r.t. 'dubious' encodings, then UAs should be allowed to support any 
legacy encoding they like unless it is explicitly listed as 'not 
supported'. That way we get to quarrel about what to ban, rather than 
about what to welcome. 

As for 'users', then I note that you for instance for IBM 864 say 
'since Presto has no support, may be we can remove it'? Opera is of 
course the dominating browser ... Though I might not understand the 
impact of the mobile Web in that statement - Opera mini is pop in 
Arabic countries? But to be certain: Where are the users in this line 
of thought?

You thereafter say that 'Chromium only supports it because of Webkit'. 
How do you know that? In my experience, Chromium appears almost biased 
towards Arabic ... E.g. for unlabelled koi8-r, then it defaults to 
Arabic ... At least on my computer and on this page - without the same 
thing happening in Safari: 
<http://www.malform.no/testing/utf/html/koi8/1>.

Personally, I'd like to see more robust detection of UTF-eight - and, 
of course - also of UTF-sixteen. As for UTF-eight, then it really ought 
be some kind of pre-default, before defaulting to the locale encoding. 
(Opera and Chrome are perhaps closest to my wish in that regard.)

Btw, what is this spec's relation to the encoding sniffing algorithm of 
HTML5 supposed to be?

And what are 'Encodings and the web'? Does XML fit in there? I think 
some would like to say 'hopefully not' ... 
 
> If you are interested in helping out testing (and reverse engineering)
> multi-octet encodings please let me know. Any other input is much
> appreciated as well

As part of my MS 'unicode' effort, I have created a test bed that I try 
to update in my perceived spare time: 
<http://www.malform.no/testing/utf/>. But it takes some time to analyze 
and document it all. However, it is quite interesting ... I will find a 
suitable place to post it when I'm ready.

One thing I've found, in that regard, is that browsers vary a good deal 
w.r.t. what they use in order to detect encoding. For instance they 
vary w.r.t. whether they use the XML prolog, both with and without the 
XML encoding inside - including in HTML - when sniffing the encoding. 
Chrome does use the XML prolog - at least it sniffs UTF-16LE and 
UTF-16BE when the prolog is there, but not necessarily otherwise. If 
you - as I think you do - want to eat into how not only HTML but also 
XML handles encodings, perhaps HTML should accept being eaten into by 
XML too? (I suggested for HTML5 that it should allow limited use of XML 
prolog, but guess if the Editor closed that bug ...)

-- 
Leif H Silli

References:
- Encodings and the web
  - From: Anne van Kesteren <annevk@opera.com>

Prev by Date: Re: Registration of new charset 'unicode'
Next by Date: Re: Encodings and the web
Prev by thread: Re: Encodings and the web
Next by thread: Re: Encodings and the web
Index(es):
- Date
- Thread