[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Indicating charset variants (was: RE: windows 936)

To: Martin Duerst <duerst@it.aoyama.ac.jp>
Subject: Re: Indicating charset variants (was: RE: windows 936)
From: Erik van der Poel <erikv@google.com>
Date: Fri, 21 Sep 2007 06:52:49 -0700
Cc: McDonald Ira <imcdonald@sharplabs.com>,Shawn Steele <Shawn.Steele@microsoft.com>, ietf-charsets@mail.apps.ietf.org
DomainKey-Signature: a=rsa-sha1; s=beta; d=google.com; c=nofws; q=dns;h=received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references;b=VBgeVeJ41uOMBUD9iWLx6U/A4GhsC12SFW1QDReiMPkD+4SwOfbAoAJ1rquhkwpCgcg55dgpyRpqnfo+s0Ykaw==
In-reply-to: <6.0.0.20.2.20070523120348.06144de0@localhost>
List-Id: <ietf-charsets.mail.apps.ietf.org>
List-Owner: <mailto:ietf-charsets-owner@mail.apps.ietf.org>
List-Subscribe: <mailto:mailserv@mail.apps.ietf.org?subject=subscribe%20ietf-charsets>
List-Unsubscribe: <mailto:mailserv@mail.apps.ietf.org?subject=unsubscribe%20ietf-charsets>
Message-hash: 4F4E57D2E8E49AE5B3A91402D2EE6E0E
Original-recipient: rfc822;ned+ietf-charsets@mrochek.com
References: <6.0.0.20.2.20070522115100.05d62db0@localhost><FCC7D7D1DB94054EB491EED9D274727D030F35@wabex2.sharpamericas.com><6.0.0.20.2.20070523120348.06144de0@localhost>
Spam-test: False ; -4.3 / 4.5 ; RCVD_IN_BSP_TRUSTED

I don't think it's such a good idea. The Web has come a long way in
terms of labelling charsets. In the early days, very few people
bothered to insert the HTML <meta> with charset, and even fewer people
inserted the HTTP charset. Nowadays, around 74% of the documents in
Google's index have the meta charset.

The commonly used characters are currently being conveyed correctly
from human to human by using the common charset names on the wire.
If/when you start to introduce charset variant names that are not
understood by the clients, even the commonly used characters cannot be
viewed, let alone the rare characters supposedly enabled by these
variant names.

Of course, if we get all the clients to upgrade first, we won't have
this problem. But are these minor variants worth all that trouble?

Erik

On 9/21/07, Martin Duerst <duerst@it.aoyama.ac.jp> wrote:
> Hello Ira, others,
>
> This is a long overdue reply.
>
> At 02:07 07/05/23, McDonald, Ira wrote:
> >Hi,
> >
> >Now that's an interesting idea, Martin.
>
> Thanks!
>
> >And "+" _is_
> >legal in charset names, per the following quote from
> >page 4 of RFC 2978:
> >
> >   Finally, charsets being registered for use with the "text" media type
> >   MUST have a primary name that conforms to the more restrictive syntax
> >   of the charset field in MIME encoded-words [RFC-2047, RFC-2184] and
> >   MIME extended parameter values [RFC-2184].  A combined ABNF
> >   definition for such names is as follows:
> >
> >     mime-charset = 1*mime-charset-chars
> >     mime-charset-chars = ALPHA / DIGIT /
> >                "!" / "#" / "$" / "%" / "&" /
> >                "'" / "+" / "-" / "^" / "_" /
> >                "`" / "{" / "}" / "~"
> >     ALPHA        = "A".."Z"    ; Case insensitive ASCII Letter
> >     DIGIT        = "0".."9"    ; Numeric digit
>
> Yes, but that may not be good enough. XML spoils things.
> The relevant production in the XML Recommendation doesn't
> allow '+'. From http://www.w3.org/TR/REC-xml/#charencoding:
>
> [80]    EncodingDecl       ::=          S 'encoding' Eq ('"' EncName '"' | "'" EncName "'" )
> [81]    EncName    ::=          [A-Za-z] ([A-Za-z0-9._] | '-')*
>
> Now there would be three ways ahead:
> - Ignore XML. I don't think we want to go there.
> - Try to change XML. A few years ago, that would have been
>   easy with an erratum, but I don't think this will be met
>   with cheers these days.
> - Choose a separator different from '+'. After quite a bit of
>   thinking, I have reached the conclusion that the obvious
>   thing to do would be to use something like '--'.
>
> What does everybody think?
>
> Regards,    Martin.
>
>
>
> >Looking at the latest posted IANA Charset Registry
> >plaintext, there are a few uses of "+" (for "+euro")
> >in aliases (but never base names), but it's pretty
> >rare.  See:
> >
> >  http://www.iana.org/assignments/character-sets
> >
> >Cheers,
> >- Ira
> >
> >Ira McDonald (Musician / Software Architect)
> >Chair - Linux Foundation Open Printing WG
> >Blue Roof Music / High North Inc
> >PO Box 221  Grand Marais, MI  49839
> >phone: +1-906-494-2434
> >email: imcdonald@sharplabs.com
> >
> >-----Original Message-----
> >From: Martin Duerst [mailto:duerst@it.aoyama.ac.jp]
> >Sent: Monday, May 21, 2007 9:58 PM
> >To: Erik van der Poel; Shawn Steele
> >Cc: ietf-charsets@mail.apps.ietf.org
> >Subject: Re: windows 936
> >
> >
> >Dealing with minor differeces and variants as notes is definitely
> >one possibility. However, I think sooner rather than later, we
> >should look at a more syntematic way of indicating variants
> >and extensions.
> >
> >Here is an extremely rough strawman:
> >
> >a) Identify a character that's okay in charset tags but rarely
> >   used (e.g. '+', don't even know whether that's okay)
> >b) Use this character to separate base tag and variants, e.g.
> >   base tag: Shift_jis
> >   tag with variant: Shift_jis+cp932
> >
> >Shift_jis would only indicate that this is some kind of shift_jis.
> >Applications that don't care too much about variants would just
> >use this. Shift_jis+cp932 indicates the variant with the Microsoft
> >additions. Applications on the receiving end not interested in
> >variants would have to cut off trailing '+' and what's after.
> >
> >The above proposal isn't without problems, but addresses the
> >second most fundamental problem in the current scheme.
> >
> >(The first most fundamental problem is that stuff is often
> >tagged wrongly. But that's a much harder problem than the variants.)
> >
> >Regards,    Martin.
> >
> >At 10:51 07/05/22, Erik van der Poel wrote:
> >>Most of the Windows code pages are "supersets" of other standard sets.
> >>But rather than adding new charset names for these supersets, it might
> >>be better to add comments to the existing registrations to point out
> >>the relationships between the various sets.
> >>
> >>For example, the windows-936 registration might refer to the gb2312
> >>one, the windows-31j registration might refer to Windows Code Page 932
> >>and the Shift_JIS registration, the EUC-KR registration might refer to
> >>CP 949 and the Big5 registration to CP 950. All as informative
> >>references, rather than normative, I think.
> >>
> >>This promotes interoperability while avoiding the addition of more
> >>names and "virtual" aliases.
> >>
> >>Erik
> >>
> >>On 5/21/07, Shawn Steele <Shawn.Steele@microsoft.com> wrote:
> >>>
> >>>
> >>>
> >>>
> >>> I am looking at the registrations for the remaining 4 "system" code pages:
> >>> 932, 936, 949 & 950.  This seems complicated since IE uses other names for
> >>> them.
> >>>
> >>>
> >>>
> >>> For example, for 936, IE recognizes Chinese, CN-GB, csGB2312, csGB231280,
> >>> csISO58GB231280, GB2312, GB2312-80, GB231280, GBK, GB_2312_80, iso-ir-58,
> >>> and, of course its known to the system as 936.
> >>>
> >>>
> >>>
> >>> Our APIs report this code page as being "gb2312"
> >>>
> >>>
> >>>
> >>> There is an existing registration for GBK, aliases of CP936, MS936 and
> >>> windows-936, but not of the gb2312 name.  The existing registration points
> >>> to broken links at Microsoft and IBM.  This should probably be updated to
> >>> point to:
> >>>
> >>>
> >>>
> >>> http://www.microsoft.com/globaldev/reference/dbcs/936.mspx
> >>>
> >>> http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP936.TXT
> >>> and
> >>>
> >>>
> >http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit936.txt
> >>>
> >>>
> >>>
> >>> I am a bit uncertain that GBK == 936, although this is what the existing
> >>> registration implies.
> >>>
> >>>
> >>>
> >>> The alternative solution would seem to be to register a new charset as
> >>> "windows-936" with the same additional aliases as the GBK registration and
> >>> point to the above tables.  This would then also lead to the question of
> >>> whether GBK and gb2312 should be listed as aliases for any such windows-936
> >>> code page although the interpretation of those aliases could differ for
> >>> other systems.
> >>>
> >>>
> >>>
> >>> My goal is to clarify the Microsoft system code page mappings such as for
> >>> 932, 936, 949 & 950, and I'd appreciate any suggestions about how to best do
> >>> that J
> >>>
> >>>
> >>>
> >>> Thanks,
> >>>
> >>>
> >>>
> >>> - Shawn
> >>>
> >>>
> >>>
> >>> Shawn Steele
> >>>
> >>> SDE
> >>>
> >>> Windows International
> >>>
> >>> Microsoft
> >>>
> >>>
> >
> >
> >#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
> >#-#-#  http://www.sw.it.aoyama.ac.jp     mailto:duerst@it.aoyama.ac.jp
> >
> >
> >No virus found in this outgoing message.
> >Checked by AVG Free Edition.
> >Version: 7.5.467 / Virus Database: 269.7.6/814 - Release Date: 5/21/2007 2:01 PM
> >
>
>
> #-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
> #-#-#  http://www.sw.it.aoyama.ac.jp      mailto:duerst@it.aoyama.ac.jp
>
>

Follow-Ups:
- RE: Indicating charset variants (was: RE: windows 936)
  - From: Shawn Steele <Shawn.Steele@microsoft.com>

References:
- Re: windows 936
  - From: Martin Duerst <duerst@it.aoyama.ac.jp>
- RE: windows 936
  - From: McDonald Ira <imcdonald@sharplabs.com>
- Indicating charset variants (was: RE: windows 936)
  - From: Martin Duerst <duerst@it.aoyama.ac.jp>

Prev by Date: Indicating charset variants (was: RE: windows 936)
Next by Date: Re: Indicating charset variants (was: RE: windows 936)
Prev by thread: Indicating charset variants (was: RE: windows 936)
Next by thread: RE: Indicating charset variants (was: RE: windows 936)
Index(es):
- Date
- Thread