[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: windows 936



I was asked about this again for a couple of code pages.  Eg: how could I clarify cp950 as a variant of GBK?  Similarly shift_jis and Windows-31J.

Thanks,
Shawn

-----Original Message-----
From: McDonald, Ira [mailto:imcdonald@sharplabs.com] 
Sent: Pōʻ, Mei 22, 2007 10:07 AM
To: Martin Duerst; Erik van der Poel; Shawn Steele
Cc: ietf-charsets@mail.apps.ietf.org
Subject: RE: windows 936

Hi,

Now that's an interesting idea, Martin.  And "+" _is_ legal in charset names, per the following quote from page 4 of RFC 2978:

   Finally, charsets being registered for use with the "text" media type
   MUST have a primary name that conforms to the more restrictive syntax
   of the charset field in MIME encoded-words [RFC-2047, RFC-2184] and
   MIME extended parameter values [RFC-2184].  A combined ABNF
   definition for such names is as follows:

     mime-charset = 1*mime-charset-chars
     mime-charset-chars = ALPHA / DIGIT /
                "!" / "#" / "$" / "%" / "&" /
                "'" / "+" / "-" / "^" / "_" /
                "`" / "{" / "}" / "~"
     ALPHA        = "A".."Z"    ; Case insensitive ASCII Letter
     DIGIT        = "0".."9"    ; Numeric digit


Looking at the latest posted IANA Charset Registry plaintext, there are a few uses of "+" (for "+euro") in aliases (but never base names), but it's pretty rare.  See:

  http://www.iana.org/assignments/character-sets

Cheers,
- Ira

Ira McDonald (Musician / Software Architect) Chair - Linux Foundation Open Printing WG Blue Roof Music / High North Inc PO Box 221  Grand Marais, MI  49839
phone: +1-906-494-2434
email: imcdonald@sharplabs.com

-----Original Message-----
From: Martin Duerst [mailto:duerst@it.aoyama.ac.jp]
Sent: Monday, May 21, 2007 9:58 PM
To: Erik van der Poel; Shawn Steele
Cc: ietf-charsets@mail.apps.ietf.org
Subject: Re: windows 936


Dealing with minor differeces and variants as notes is definitely one possibility. However, I think sooner rather than later, we should look at a more syntematic way of indicating variants and extensions.

Here is an extremely rough strawman:

a) Identify a character that's okay in charset tags but rarely
   used (e.g. '+', don't even know whether that's okay)
b) Use this character to separate base tag and variants, e.g.
   base tag: Shift_jis
   tag with variant: Shift_jis+cp932

Shift_jis would only indicate that this is some kind of shift_jis.
Applications that don't care too much about variants would just use this. Shift_jis+cp932 indicates the variant with the Microsoft additions. Applications on the receiving end not interested in variants would have to cut off trailing '+' and what's after.


The above proposal isn't without problems, but addresses the second most fundamental problem in the current scheme.

(The first most fundamental problem is that stuff is often tagged wrongly. But that's a much harder problem than the variants.)

Regards,    Martin.

At 10:51 07/05/22, Erik van der Poel wrote:
>Most of the Windows code pages are "supersets" of other standard sets.
>But rather than adding new charset names for these supersets, it might 
>be better to add comments to the existing registrations to point out 
>the relationships between the various sets.
>
>For example, the windows-936 registration might refer to the gb2312 
>one, the windows-31j registration might refer to Windows Code Page 932 
>and the Shift_JIS registration, the EUC-KR registration might refer to 
>CP 949 and the Big5 registration to CP 950. All as informative 
>references, rather than normative, I think.
>
>This promotes interoperability while avoiding the addition of more 
>names and "virtual" aliases.
>
>Erik
>
>On 5/21/07, Shawn Steele <Shawn.Steele@microsoft.com> wrote:
>>
>>
>>
>>
>> I am looking at the registrations for the remaining 4 "system" code pages:
>> 932, 936, 949 & 950.  This seems complicated since IE uses other 
>> names for them.
>>
>>
>>
>> For example, for 936, IE recognizes Chinese, CN-GB, csGB2312, 
>> csGB231280, csISO58GB231280, GB2312, GB2312-80, GB231280, GBK, 
>> GB_2312_80, iso-ir-58, and, of course its known to the system as 936.
>>
>>
>>
>> Our APIs report this code page as being "gb2312"
>>
>>
>>
>> There is an existing registration for GBK, aliases of CP936, MS936 
>> and windows-936, but not of the gb2312 name.  The existing 
>> registration points to broken links at Microsoft and IBM.  This 
>> should probably be updated to point to:
>>
>>
>>
>> http://www.microsoft.com/globaldev/reference/dbcs/936.mspx
>>
>> http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP936.T
>> XT
>> and
>>
>> http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/
>> bestfit936.txt
>>
>>
>>
>> I am a bit uncertain that GBK == 936, although this is what the 
>> existing registration implies.
>>
>>
>>
>> The alternative solution would seem to be to register a new charset 
>> as "windows-936" with the same additional aliases as the GBK 
>> registration and point to the above tables.  This would then also 
>> lead to the question of whether GBK and gb2312 should be listed as 
>> aliases for any such windows-936 code page although the 
>> interpretation of those aliases could differ for other systems.
>>
>>
>>
>> My goal is to clarify the Microsoft system code page mappings such as 
>> for 932, 936, 949 & 950, and I'd appreciate any suggestions about how 
>> to best do that J
>>
>>
>>
>> Thanks,
>>
>>
>>
>> - Shawn
>>
>>
>>
>> Shawn Steele
>>
>> SDE
>>
>> Windows International
>>
>> Microsoft
>>
>>


#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst@it.aoyama.ac.jp     


No virus found in this outgoing message.
Checked by AVG Free Edition. 
Version: 7.5.467 / Virus Database: 269.7.6/814 - Release Date: 5/21/2007 2:01 PM