[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Indicating charset variants (was: RE: windows 936)



Hello Ira, others,

This is a long overdue reply.

At 02:07 07/05/23, McDonald, Ira wrote:
>Hi,
>
>Now that's an interesting idea, Martin.

Thanks!

>And "+" _is_
>legal in charset names, per the following quote from
>page 4 of RFC 2978:
>
>   Finally, charsets being registered for use with the "text" media type
>   MUST have a primary name that conforms to the more restrictive syntax
>   of the charset field in MIME encoded-words [RFC-2047, RFC-2184] and
>   MIME extended parameter values [RFC-2184].  A combined ABNF
>   definition for such names is as follows:
>
>     mime-charset = 1*mime-charset-chars
>     mime-charset-chars = ALPHA / DIGIT /
>                "!" / "#" / "$" / "%" / "&" /
>                "'" / "+" / "-" / "^" / "_" /
>                "`" / "{" / "}" / "~"
>     ALPHA        = "A".."Z"    ; Case insensitive ASCII Letter
>     DIGIT        = "0".."9"    ; Numeric digit

Yes, but that may not be good enough. XML spoils things.
The relevant production in the XML Recommendation doesn't
allow '+'. From http://www.w3.org/TR/REC-xml/#charencoding:

[80]    EncodingDecl       ::=          S 'encoding' Eq ('"' EncName '"' | "'" EncName "'" ) 
[81]    EncName    ::=          [A-Za-z] ([A-Za-z0-9._] | '-')*

Now there would be three ways ahead:
- Ignore XML. I don't think we want to go there.
- Try to change XML. A few years ago, that would have been
  easy with an erratum, but I don't think this will be met
  with cheers these days.
- Choose a separator different from '+'. After quite a bit of
  thinking, I have reached the conclusion that the obvious
  thing to do would be to use something like '--'.

What does everybody think?

Regards,    Martin.



>Looking at the latest posted IANA Charset Registry
>plaintext, there are a few uses of "+" (for "+euro")
>in aliases (but never base names), but it's pretty 
>rare.  See:
>
>  http://www.iana.org/assignments/character-sets
>
>Cheers,
>- Ira
>
>Ira McDonald (Musician / Software Architect)
>Chair - Linux Foundation Open Printing WG
>Blue Roof Music / High North Inc
>PO Box 221  Grand Marais, MI  49839
>phone: +1-906-494-2434
>email: imcdonald@sharplabs.com
>
>-----Original Message-----
>From: Martin Duerst [mailto:duerst@it.aoyama.ac.jp]
>Sent: Monday, May 21, 2007 9:58 PM
>To: Erik van der Poel; Shawn Steele
>Cc: ietf-charsets@mail.apps.ietf.org
>Subject: Re: windows 936
>
>
>Dealing with minor differeces and variants as notes is definitely
>one possibility. However, I think sooner rather than later, we
>should look at a more syntematic way of indicating variants
>and extensions.
>
>Here is an extremely rough strawman:
>
>a) Identify a character that's okay in charset tags but rarely
>   used (e.g. '+', don't even know whether that's okay)
>b) Use this character to separate base tag and variants, e.g.
>   base tag: Shift_jis
>   tag with variant: Shift_jis+cp932
>
>Shift_jis would only indicate that this is some kind of shift_jis.
>Applications that don't care too much about variants would just
>use this. Shift_jis+cp932 indicates the variant with the Microsoft
>additions. Applications on the receiving end not interested in
>variants would have to cut off trailing '+' and what's after.
>
>The above proposal isn't without problems, but addresses the
>second most fundamental problem in the current scheme.
>
>(The first most fundamental problem is that stuff is often
>tagged wrongly. But that's a much harder problem than the variants.)
>
>Regards,    Martin.
>
>At 10:51 07/05/22, Erik van der Poel wrote:
>>Most of the Windows code pages are "supersets" of other standard sets.
>>But rather than adding new charset names for these supersets, it might
>>be better to add comments to the existing registrations to point out
>>the relationships between the various sets.
>>
>>For example, the windows-936 registration might refer to the gb2312
>>one, the windows-31j registration might refer to Windows Code Page 932
>>and the Shift_JIS registration, the EUC-KR registration might refer to
>>CP 949 and the Big5 registration to CP 950. All as informative
>>references, rather than normative, I think.
>>
>>This promotes interoperability while avoiding the addition of more
>>names and "virtual" aliases.
>>
>>Erik
>>
>>On 5/21/07, Shawn Steele <Shawn.Steele@microsoft.com> wrote:
>>>
>>>
>>>
>>>
>>> I am looking at the registrations for the remaining 4 "system" code pages:
>>> 932, 936, 949 & 950.  This seems complicated since IE uses other names for
>>> them.
>>>
>>>
>>>
>>> For example, for 936, IE recognizes Chinese, CN-GB, csGB2312, csGB231280,
>>> csISO58GB231280, GB2312, GB2312-80, GB231280, GBK, GB_2312_80, iso-ir-58,
>>> and, of course its known to the system as 936.
>>>
>>>
>>>
>>> Our APIs report this code page as being "gb2312"
>>>
>>>
>>>
>>> There is an existing registration for GBK, aliases of CP936, MS936 and
>>> windows-936, but not of the gb2312 name.  The existing registration points
>>> to broken links at Microsoft and IBM.  This should probably be updated to
>>> point to:
>>>
>>>
>>>
>>> http://www.microsoft.com/globaldev/reference/dbcs/936.mspx
>>>
>>> http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP936.TXT
>>> and
>>>
>>> 
>http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit936.txt
>>>
>>>
>>>
>>> I am a bit uncertain that GBK == 936, although this is what the existing
>>> registration implies.
>>>
>>>
>>>
>>> The alternative solution would seem to be to register a new charset as
>>> "windows-936" with the same additional aliases as the GBK registration and
>>> point to the above tables.  This would then also lead to the question of
>>> whether GBK and gb2312 should be listed as aliases for any such windows-936
>>> code page although the interpretation of those aliases could differ for
>>> other systems.
>>>
>>>
>>>
>>> My goal is to clarify the Microsoft system code page mappings such as for
>>> 932, 936, 949 & 950, and I'd appreciate any suggestions about how to best do
>>> that J
>>>
>>>
>>>
>>> Thanks,
>>>
>>>
>>>
>>> - Shawn
>>>
>>>
>>>
>>> Shawn Steele
>>>
>>> SDE
>>>
>>> Windows International
>>>
>>> Microsoft
>>>
>>>
>
>
>#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
>#-#-#  http://www.sw.it.aoyama.ac.jp     mailto:duerst@it.aoyama.ac.jp   
>
>
>No virus found in this outgoing message.
>Checked by AVG Free Edition. 
>Version: 7.5.467 / Virus Database: 269.7.6/814 - Release Date: 5/21/2007 2:01 PM
> 


#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp      mailto:duerst@it.aoyama.ac.jp