[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Encodings and the web



On Wed, 21 Dec 2011 12:09:03 +0100, Martin J. Dürst  
<duerst@it.aoyama.ac.jp> wrote:
>> * Error handling for encodings is undefined (can lead to XSS exploits,
>> also gives interoperability problems)
>
> We don't have security considerations for charset registrations, but we  
> probably should. If there are any specific issues, I'm sure we will find  
> a way to add them to the registry, so please send them here.

See e.g. http://zaynar.co.uk/docs/charset-encoding-xss.html


>> * Often encodings are implemented differently from the standard
>
> There are often minor differences. ICU has a huge collection of  
> variants. Getting rid of them may be impossible, unfortunately. As an  
> example, both Microsoft and Apple have long-standing traditions for some  
> differences, and both may not be very willing to change.

We'll see. The same was said when we tackled the HTML parser  
interoperability mess.


> It would be good to know where and how your tables differ from "the  
> standard".
>
> It would also be good to know what it is that you refer to by "the  
> standard". (I guess I would know if there were a "standard" for all  
> character encodings, but I don't know of such a thing.)

When my document says "standard" it refers to itself. I have thought about  
including differences with respect to the IANA registry, but was hoping  
someone else would do that based on the data tables available.


> I have quickly looked at the document. Many single-octet encodings only  
> show a few rows. As an expert, I have a pretty sure feel of where to  
> look for the rest of the table, but it would really be better if there  
> was an explicit pointer to the table used for completion from the table  
> that lacked the rows.

In the beginning of the section it states that in missing rows all octet  
values match the code point value. So if 80-8F == U+0080-U+008F the row is  
simply not there for brevity.


> Also, it would be very helpful if the entries where there are  
> differences were marked as such directly in the table. Having to go back  
> and forth between table and notes is really though.

The plan is for the notes to go away by implementations aligning. Both  
Mozilla and Opera are making some effort towards that.


> Also, it would probably be better to use a special value instead of  
> U+FFFD for undefined values. Somewhere at the start, or in another spec,  
> it can then say what that means.

Thank you, I have done this now.


> Another point is that "platform" turns up a lot. It would be much easier  
> to understand for outsiders if it read "Web platform".

Also done for now.


> Instead of writing "Define the finite list of encodings for the platform  
> and obsolete the "CHARACTER SETS" registry.", please use some wording  
> that makes it clear that your document does NOT obsolete the charset  
> registry (even if it may make it irrelevant for Web browsers).

Fair enough, done.


> You write "Need to define decode a byte string as UTF-8, with error  
> handling in a way that avoids external dependencies.". I'm really not  
> sure why this would be needed. The Unicode consortium went over UTF-8  
> decoding issues with a very fine comb many times. If you find a hair in  
> their soup, they have to fix it. If not, duplicating the work doesn't  
> help at all. What you might need is some glue text, because the Unicode  
> spec is worded for various situations, not only final display on Web UAs.

That is what HTML currently does (as referenced from the issue). It does  
not strike me as ideal for implementors.


> Last but not least, solving encoding conversion issues does not fix all  
> problems. On a Japanese OS, I regularly see 0+005C as a Yen symbol  
> rather than as a backslash. I haven't looked at browser differences in  
> this respect, but I'm sure they exist.

That particular issue is related to fonts in most browsers. I do not  
expect the Japanese fonts that display U+005C as Yen to change, but  
regardless it is out of scope for this document.


-- 
Anne van Kesteren
http://annevankesteren.nl/