[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Encodings and the web
Hello Anne,
On 2011/12/20 19:59, Anne van Kesteren wrote:
> Hi,
>
> When doing research into encodings as implemented by popular user agents I
> have found the current standards lacking. In particular:
>
> * More encodings in the registry than needed for the web
That's not a surprise. I think there are two reasons. One is that the
main initial contributor tried to me more general than necessary, and
included things that aren't actually usable, such as just the
double-byte parts of some multibyte standards, and so on. The second is
that, believe it or not, there's actually stuff outside the Web.
There is absolutely no problem for a Web-related spec to say "these are
the encodings you should support in a Web UA, and not more". The reason
we didn't do something like this in the RFC 2070 or HTML4 timeframe is
that at that time it wasn't opportune yet to say "no more new encodings,
please use Unicode".
> * Error handling for encodings is undefined (can lead to XSS exploits,
> also gives interoperability problems)
We don't have security considerations for charset registrations, but we
probably should. If there are any specific issues, I'm sure we will find
a way to add them to the registry, so please send them here.
> * Often encodings are implemented differently from the standard
There are often minor differences. ICU has a huge collection of
variants. Getting rid of them may be impossible, unfortunately. As an
example, both Microsoft and Apple have long-standing traditions for some
differences, and both may not be very willing to change.
It would be good to know where and how your tables differ from "the
standard".
It would also be good to know what it is that you refer to by "the
standard". (I guess I would know if there were a "standard" for all
character encodings, but I don't know of such a thing.)
> A year ago I did some research into encodings[1] and more detailed for
> single-octet encodings[2] and I have now taken that further into starting
> to define a standard[3] for encodings as they are to be implemented by
> user agents. The current scope is roughly defining the encodings, their
> labels and name, and how you match a label.
I have quickly looked at the document. Many single-octet encodings only
show a few rows. As an expert, I have a pretty sure feel of where to
look for the rest of the table, but it would really be better if there
was an explicit pointer to the table used for completion from the table
that lacked the rows.
Also, it would be very helpful if the entries where there are
differences were marked as such directly in the table. Having to go back
and forth between table and notes is really though.
Also, it would probably be better to use a special value instead of
U+FFFD for undefined values. Somewhere at the start, or in another spec,
it can then say what that means. The reason for that is that in contexts
other than final display, one wants other things to happen than just
conversion to U+FFFD.
Another point is that "platform" turns up a lot. It would be much easier
to understand for outsiders if it read "Web platform".
Instead of writing "Define the finite list of encodings for the platform
and obsolete the "CHARACTER SETS" registry.", please use some wording
that makes it clear that your document does NOT obsolete the charset
registry (even if it may make it irrelevant for Web browsers).
You write "Need to define decode a byte string as UTF-8, with error
handling in a way that avoids external dependencies.". I'm really not
sure why this would be needed. The Unicode consortium went over UTF-8
decoding issues with a very fine comb many times. If you find a hair in
their soup, they have to fix it. If not, duplicating the work doesn't
help at all. What you might need is some glue text, because the Unicode
spec is worded for various situations, not only final display on Web UAs.
Last but not least, solving encoding conversion issues does not fix all
problems. On a Japanese OS, I regularly see 0+005C as a Yen symbol
rather than as a backslash. I haven't looked at browser differences in
this respect, but I'm sure they exist.
Regards, Martin.
> The goal is to unify encoding handling across user agents for the web so
> legacy pages can be interpreted "correctly" (i.e. as expected by users).
>
> If you are interested in helping out testing (and reverse engineering)
> multi-octet encodings please let me know. Any other input is much
> appreciated as well.
>
> Kind regards,
>
>
> [1]<http://wiki.whatwg.org/wiki/Web_Encodings>
> [2]<http://annevankesteren.nl/2010/12/encodings-labels-tested>
> [3]<http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html>
>
>