[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Encoding Standard (mostly complete)



On Wed, 18 Apr 2012 22:25:07 +0200, Shawn Steele  
<Shawn.Steele@microsoft.com> wrote:
> * Carefully define the algorithms to implement these encodings
>
> Encodings are really icky and inconsistent.  Additional definitions are  
> likely to only cause further divergence of actual implementations from  
> each other.  IMO it would be better to point to the already defined  
> definitions than trying to do it all again.  In fact that's what the  
> charset registry was intended to help with?

My experience is that by defining a feature in detail and writing a test  
suite implementations will converge over time. E.g. it was once  
controversial that HTML parsing could be defined and implemented in the  
same manner across browsers. (HTML parsers were really icky and  
inconsistent, and a lot more complicated than decoder/encoder algorithms  
if you look at their interaction with script execution.)

I can maybe explain why http://www.iana.org/assignments/character-sets  
does not help implementors.

This is its entry for shift_jis:

===
Name: Shift_JIS  (preferred MIME name)
MIBenum: 17
Source: This charset is an extension of csHalfWidthKatakana by
         adding graphic characters in JIS X 0208.  The CCS's are
         JIS X0201:1997 and JIS X0208:1997.  The
         complete definition is shown in Appendix 1 of JIS
         X0208:1997.
         This charset can be used for the top-level media type "text".
Alias: MS_Kanji
Alias: csShiftJIS
===

This does not tell you the other labels you need to recognize, such as  
"shift-jis" or "x-sjis". It references an extremely old document that does  
not detail error handling end-of-file handling or a clear mapping to  
Unicode or their relation to other Japanese encodings. It does not detail  
the extensions to shift_jis made by Microsoft that you need to implement  
in order to work with sites. It indeed misses all the critical details.

Entries for euc-kr, gb_2312-80, ... are similarly not helpful. euc-kr does  
not mention you need to support Unified Hangul Code as Internet Explorer  
does in order to work with Korean content and gb_2312-80 does not mention  
you should really use your gbk decoder/encoder instead.

The registry is an interesting collection of entries, but does not help  
implementors.


> I'd also really push UTF-8, and really discourage many of the  
> encodings.  If you really want to say which ones should be legal for  
> HTML, I'd pick the smallest useful subset of encodings, rather than  
> superset of all of the encodings that happen to be supported by any  
> browser.

The Encoding Standard definitely does not define a superset.


 From another reply:

On Wed, 18 Apr 2012 22:31:27 +0200, Shawn Steele  
<Shawn.Steele@microsoft.com> wrote:
> Well, maybe point to the defined ones then that you want to support in  
> HTML.  We can't change our behavior, so if your document happens to  
> diverge, it'll introduce additional confusion.  Current  
> cross-vendor/platform implementations already vary, but those ways  
> should be well understood by people who hit those issues.

No they are not well understood. I do not know about Internet Explorer,  
but browsers other than Internet Explorer continue to hit compatibility  
issues in this part of their code and continue to make changes because of  
it, without clear guidance thus far as what the end goal ought to be and  
what everyone else is aiming for.


-- 
Anne van Kesteren
http://annevankesteren.nl/