[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Encoding Standard (mostly complete)



On Tue, 17 Apr 2012 20:20:34 +0200, Doug Ewell <doug@ewellic.org> wrote:
> Shawn Steele <Shawn dot Steele at microsoft dot com> wrote:
>> I'm a little confused about what the purpose of the document is?
>
> I assume it was intended to document the encodings deemed permissible in
> HTML5, which I guess is supposed to be synonymous with "the web
> platform."

More or less, yes. Encodings to be used by HTML, CSS, browser  
implementations of XML, etc. As I explained before on this mailing list  
http://mail.apps.ietf.org/ietf/charsets/msg02027.html the idea is to:

* Make the encodings that can be supported a finite list
* Carefully define the labels for these encodings
* Carefully define the algorithms to implement these encodings
** Including error and end-of-file handling
* Carefully define the indexes for these encodings, including any poorly  
documented extensions

The idea is to make the web platform completely predictable with respect  
to encodings rather than the morass it is now. This should help existing  
implementations compete more effectively as well as help new  
implementations enter the market more easily without significant reverse  
engineering costs.


> I was surprised by some of the choices of "permissible," such as
> including ibm864 and ibm866 but none of the other, much more widespread,
> legacy OEM code pages. I was also puzzled by the reference to utf-16 and
> utf-16be as "legacy" encodings.

I'm not quite sure if ibm864 and ibm866 should stay, they are not  
universally supported but four out of five user agents have them if I  
remember correctly. The list of encodings is based roughly on the  
intersection of what browsers support. If I missed an encoding that is  
actually "widely" used on pages it would be good to add it of course. My  
assumption has been that if only one browser supports the encoding it is  
probably not or not widely used.

I classified utf-16 as legacy because of its many gotchas and because most  
web technology works entirely with utf-8 or does not work with utf-16.  
E.g. form submission does not do utf-16, XMLHttpRequest only sends utf-8  
encoded strings, several new formats are utf-8 only.


-- 
Anne van Kesteren
http://annevankesteren.nl/