[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Encodings and the web



Hello Anne,

On 2011/12/20 19:59, Anne van Kesteren wrote:
> Hi,
>
> When doing research into encodings as implemented by popular user agents I
> have found the current standards lacking. In particular:
>
> * More encodings in the registry than needed for the web

That's not a surprise. I think there are two reasons. One is that the 
main initial contributor tried to me more general than necessary, and 
included things that aren't actually usable, such as just the 
double-byte parts of some multibyte standards, and so on. The second is 
that, believe it or not, there's actually stuff outside the Web.

There is absolutely no problem for a Web-related spec to say "these are 
the encodings you should support in a Web UA, and not more". The reason 
we didn't do something like this in the RFC 2070 or HTML4 timeframe is 
that at that time it wasn't opportune yet to say "no more new encodings, 
please use Unicode".

> * Error handling for encodings is undefined (can lead to XSS exploits,
> also gives interoperability problems)

We don't have security considerations for charset registrations, but we 
probably should. If there are any specific issues, I'm sure we will find 
a way to add them to the registry, so please send them here.

> * Often encodings are implemented differently from the standard

There are often minor differences. ICU has a huge collection of 
variants. Getting rid of them may be impossible, unfortunately. As an 
example, both Microsoft and Apple have long-standing traditions for some 
differences, and both may not be very willing to change.

It would be good to know where and how your tables differ from "the 
standard".

It would also be good to know what it is that you refer to by "the 
standard". (I guess I would know if there were a "standard" for all 
character encodings, but I don't know of such a thing.)

> A year ago I did some research into encodings[1] and more detailed for
> single-octet encodings[2] and I have now taken that further into starting
> to define a standard[3] for encodings as they are to be implemented by
> user agents. The current scope is roughly defining the encodings, their
> labels and name, and how you match a label.

I have quickly looked at the document. Many single-octet encodings only 
show a few rows. As an expert, I have a pretty sure feel of where to 
look for the rest of the table, but it would really be better if there 
was an explicit pointer to the table used for completion from the table 
that lacked the rows.

Also, it would be very helpful if the entries where there are 
differences were marked as such directly in the table. Having to go back 
and forth between table and notes is really though.

Also, it would probably be better to use a special value instead of 
U+FFFD for undefined values. Somewhere at the start, or in another spec, 
it can then say what that means. The reason for that is that in contexts 
other than final display, one wants other things to happen than just 
conversion to U+FFFD.

Another point is that "platform" turns up a lot. It would be much easier 
to understand for outsiders if it read "Web platform".

Instead of writing "Define the finite list of encodings for the platform 
and obsolete the "CHARACTER SETS" registry.", please use some wording 
that makes it clear that your document does NOT obsolete the charset 
registry (even if it may make it irrelevant for Web browsers).

You write "Need to define decode a byte string as UTF-8, with error 
handling in a way that avoids external dependencies.". I'm really not 
sure why this would be needed. The Unicode consortium went over UTF-8 
decoding issues with a very fine comb many times. If you find a hair in 
their soup, they have to fix it. If not, duplicating the work doesn't 
help at all. What you might need is some glue text, because the Unicode 
spec is worded for various situations, not only final display on Web UAs.

Last but not least, solving encoding conversion issues does not fix all 
problems. On a Japanese OS, I regularly see 0+005C as a Yen symbol 
rather than as a backslash. I haven't looked at browser differences in 
this respect, but I'm sure they exist.

Regards,    Martin.

> The goal is to unify encoding handling across user agents for the web so
> legacy pages can be interpreted "correctly" (i.e. as expected by users).
>
> If you are interested in helping out testing (and reverse engineering)
> multi-octet encodings please let me know. Any other input is much
> appreciated as well.
>
> Kind regards,
>
>
> [1]<http://wiki.whatwg.org/wiki/Web_Encodings>
> [2]<http://annevankesteren.nl/2010/12/encodings-labels-tested>
> [3]<http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html>
>
>