[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Encoding Standard (mostly complete)



* Carefully define the algorithms to implement these encodings

Encodings are really icky and inconsistent.  Additional definitions are likely to only cause further divergence of actual implementations from each other.  IMO it would be better to point to the already defined definitions than trying to do it all again.  In fact that's what the charset registry was intended to help with?

I'd also really push UTF-8, and really discourage many of the encodings.  If you really want to say which ones should be legal for HTML, I'd pick the smallest useful subset of encodings, rather than superset of all of the encodings that happen to be supported by any browser.

-Shawn

-----Original Message-----
From: Anne van Kesteren [mailto:annevk@opera.com] 
Sent: Tuesday, April 17, 2012 11:41 PM
To: Shawn Steele; ietf-charsets; Doug Ewell
Subject: Re: Encoding Standard (mostly complete)

On Tue, 17 Apr 2012 20:20:34 +0200, Doug Ewell <doug@ewellic.org> wrote:
> Shawn Steele <Shawn dot Steele at microsoft dot com> wrote:
>> I'm a little confused about what the purpose of the document is?
>
> I assume it was intended to document the encodings deemed permissible 
> in HTML5, which I guess is supposed to be synonymous with "the web 
> platform."

More or less, yes. Encodings to be used by HTML, CSS, browser implementations of XML, etc. As I explained before on this mailing list http://mail.apps.ietf.org/ietf/charsets/msg02027.html the idea is to:

* Make the encodings that can be supported a finite list
* Carefully define the labels for these encodings
* Carefully define the algorithms to implement these encodings
** Including error and end-of-file handling
* Carefully define the indexes for these encodings, including any poorly documented extensions

The idea is to make the web platform completely predictable with respect to encodings rather than the morass it is now. This should help existing implementations compete more effectively as well as help new implementations enter the market more easily without significant reverse engineering costs.


> I was surprised by some of the choices of "permissible," such as 
> including ibm864 and ibm866 but none of the other, much more 
> widespread, legacy OEM code pages. I was also puzzled by the reference 
> to utf-16 and utf-16be as "legacy" encodings.

I'm not quite sure if ibm864 and ibm866 should stay, they are not universally supported but four out of five user agents have them if I remember correctly. The list of encodings is based roughly on the intersection of what browsers support. If I missed an encoding that is actually "widely" used on pages it would be good to add it of course. My assumption has been that if only one browser supports the encoding it is probably not or not widely used.

I classified utf-16 as legacy because of its many gotchas and because most web technology works entirely with utf-8 or does not work with utf-16.  
E.g. form submission does not do utf-16, XMLHttpRequest only sends utf-8 encoded strings, several new formats are utf-8 only.


--
Anne van Kesteren
http://annevankesteren.nl/