[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Registering a charset alias



I agree with what Ned says below. One can imagine (dream of) a 
'superset' registry at IANA, but for the moment, it doesn't exist.

Supersetting behavior such as treating US-ASCII as ISO-8859-1 as 
windows-1252 may be a current, hard to change practice for HTML 
consumers (browsers and such), but this doesn't mean that producers 
should tag windows-1252 as ISO-8859-1 as US-ASCII and the like, and it 
doesn't mean that consumers for other protocols do or should behave the 
same as for HTML (XML consumers, as an example, definitely never should 
behave that way).

So it seems that the best place to describe this supersetting behavior 
is indeed the part of HTML5 that describes HTML consumers, while making 
sure that the part of HTML5 that describes HTML producers is very clear 
on the fact that only strictly 7-bit data should be labeled as US-ASCII, 
and so on.

As an aside, for GB2312 vs. GBK vs. GB18030, these are indeed (at least 
close to) supersets, but especially for GB18030, there are significant 
size differences for conversion table sizes.

Regards,    Martin.

On 2009/08/20 7:13, Ned Freed wrote:
>> So if I understand this data correctly IE does not treat ISO-8859-1 and
>> Windows-1252 the same?
>
> FYI, there are quite a few differences between iso-8859-1 and windows-1252. In
> summary, the windows variant elected to use the C1 region (0x80-0x9F) for a
> bunch of stuff, none of which appears in iso-8859-1. The Euro symbol at 0x80 is
> arguably the most significant difference in practice.
>
>> That is not my experience, but maybe I do not understand
>> the code pages concept good enough.
>
> It may be that IE treats ISO-8859-1 the same as windows-1252 because ISO-8859-1
> is in some sense a subset. But you'd be well advised not to count on that
> behavior.
>
>>> I think most of our encodings don't lend themselves to the superset
>>> concept.  There're probably variations for individual code points even
>>> in closely related code pages.  GB18030 might be an exception there.
>
> I'd have to check to be sure, but I believe the Microsoft variant of GBK
> contains some stuff that isn't in GB18030. (Microsoft additions leaking into
> the subset charsets is a very common problem, especially in the CJK sets.)
>
> 				Ned
>

-- 
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:duerst@it.aoyama.ac.jp