[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Encoding Standard (mostly complete)
On Wed, 18 Apr 2012 22:25:07 +0200, Shawn Steele
<Shawn.Steele@microsoft.com> wrote:
> * Carefully define the algorithms to implement these encodings
>
> Encodings are really icky and inconsistent. Additional definitions are
> likely to only cause further divergence of actual implementations from
> each other. IMO it would be better to point to the already defined
> definitions than trying to do it all again. In fact that's what the
> charset registry was intended to help with?
My experience is that by defining a feature in detail and writing a test
suite implementations will converge over time. E.g. it was once
controversial that HTML parsing could be defined and implemented in the
same manner across browsers. (HTML parsers were really icky and
inconsistent, and a lot more complicated than decoder/encoder algorithms
if you look at their interaction with script execution.)
I can maybe explain why http://www.iana.org/assignments/character-sets
does not help implementors.
This is its entry for shift_jis:
===
Name: Shift_JIS (preferred MIME name)
MIBenum: 17
Source: This charset is an extension of csHalfWidthKatakana by
adding graphic characters in JIS X 0208. The CCS's are
JIS X0201:1997 and JIS X0208:1997. The
complete definition is shown in Appendix 1 of JIS
X0208:1997.
This charset can be used for the top-level media type "text".
Alias: MS_Kanji
Alias: csShiftJIS
===
This does not tell you the other labels you need to recognize, such as
"shift-jis" or "x-sjis". It references an extremely old document that does
not detail error handling end-of-file handling or a clear mapping to
Unicode or their relation to other Japanese encodings. It does not detail
the extensions to shift_jis made by Microsoft that you need to implement
in order to work with sites. It indeed misses all the critical details.
Entries for euc-kr, gb_2312-80, ... are similarly not helpful. euc-kr does
not mention you need to support Unified Hangul Code as Internet Explorer
does in order to work with Korean content and gb_2312-80 does not mention
you should really use your gbk decoder/encoder instead.
The registry is an interesting collection of entries, but does not help
implementors.
> I'd also really push UTF-8, and really discourage many of the
> encodings. If you really want to say which ones should be legal for
> HTML, I'd pick the smallest useful subset of encodings, rather than
> superset of all of the encodings that happen to be supported by any
> browser.
The Encoding Standard definitely does not define a superset.
From another reply:
On Wed, 18 Apr 2012 22:31:27 +0200, Shawn Steele
<Shawn.Steele@microsoft.com> wrote:
> Well, maybe point to the defined ones then that you want to support in
> HTML. We can't change our behavior, so if your document happens to
> diverge, it'll introduce additional confusion. Current
> cross-vendor/platform implementations already vary, but those ways
> should be well understood by people who hit those issues.
No they are not well understood. I do not know about Internet Explorer,
but browsers other than Internet Explorer continue to hit compatibility
issues in this part of their code and continue to make changes because of
it, without clear guidance thus far as what the end goal ought to be and
what everyone else is aiming for.
--
Anne van Kesteren
http://annevankesteren.nl/