[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Registration of new charset CP51932



Hi,

Still missing in this registration is one charset alias of the form
"csXxxx", where "Xxxx" is usually the primary name of the charset,
e.g., "csCP51932" in this case.

Section 2.3 on page 4 of IANA Charset Registration Procedures
(RFC 2978 / BCP 19) says:

   "All charsets MUST be assigned a name that provides a display string
   for the associated "MIBenum" value defined below.  These "MIBenum"
   values are defined by and used in the Printer MIB [RFC-1759].  Such
   names MUST begin with the letters "cs" and MUST contain no more than
   40 characters (including the "cs" prefix) chosen from from the
   printable subset of US-ASCII.  Only one name beginning with "cs" may
   be assigned to a single charset.  If no name of this form is
   explicitly defined IANA will assign an alias consisting of "cs"
   prepended to the primary charset name."

In Printer MIB v2 (RFC 3805), these "csXxxx" aliases were moved out
of the Printer MIB and into the IANA Charset MIB (RFC 3808).

Cheers,
- Ira (editor of IANA Charset MIB, RFC 3808).

Ira McDonald (Musician / Software Architect)
Chair - Linux Foundation Open Printing WG
Co-Chair - TCG Hardcopy WG
IETF Designated Expert - IPP & Printer MIB
Blue Roof Music/High North Inc
http://sites.google.com/site/blueroofmusic
http://sites.google.com/site/highnorthinc
mailto:blueroofmusic@gmail.com
winter:
  579 Park Place  Saline, MI  48176
  734-944-0094
summer:
  PO Box 221  Grand Marais, MI 49839
  906-494-2434



On Mon, Apr 5, 2010 at 6:15 AM, NARUSE Yui <naruse@airemix.jp> wrote:
> Thank you for comment,
>
> (2010/04/05 1:11), Ned Freed wrote:
>>
>> (In addition to the specific comments I've made below, I think a general
>> dicussion of how to handle this whole CP932 mess is probably in order.)
>
> There are two strategies: override or option.
> HTML5 uses overrides. They must treat exist documents which are mistakenly
> labeled as Shift_JIS or EUC-JP. In those situation people should override.
>
> Otherwise converters, databases, Programing languages added new charset as
> an option. Their user can one of them.
>
> IANA Charsets are back-end of libraries, so they should choose option
> solution.
>
>>> Yes, CP51932 is suitable for use with subtypes of the "text"
>>> Content-Type. Note that CP51932 is an multi-octet charset.
>>> Care should be taken to choose an appropriate Content-Transfer-Encoding.
>>
>> While this is a generally true statement, it doesn't follow from the
>> charset
>> being multi-octet. There are multi-octet charsets like iso-2022-jp that
>> are 7bit and hence require no special encoding at all, others like utf-8
>> are 8bit, and still others like utf-16 require binary.
>>
>> I suggest changing this to "Since CP1932 is an 8bit charset Care should be
>> taken to choose an appropriate Content-Transfer-Encoding."
>
> I changed it.
>
>>> Published specification(s):
>>
>>> Uses ISO 2022 rules to select:
>>
>>> code set 0: US-ASCII (a single 7-bit byte set)
>>> * 0x5C is U+005C : REVERSE SOLIDUS (YEN SIGN)
>>
>> This makes no sense to me. A Reverse Solidus isn't a Yen sign.
>
> In JIS X 0201 (Japanese version of ISO 646), 0x5C is YEN SIGN,
> and Shift_JIS's 7bit area is JIS X 0201.
> But on the context of conversion to Unicode,
> on convertion the source code of C Language or BASIC or something
> 0x5C -> U+00A5 conversion breaks it meaning: escape character.
> So Microsoft treat 0x5C of Windows Codepage 932 as REVERSE SOLIDUS,
> but its glyph is YEN SIGN (bundled font is hacked that U+005C's glyph
> is YEN SIGN, like MS Gothic).
>
> "When is a backslash not a backslash?"
> http://blogs.msdn.com/michkap/archive/2005/09/17/469941.aspx
>
> I mentioned it in this, but it should be described in Additional
> Information.
>
>>> * 0x7E is U+007E : TILDE
>>
>> AFAIK 0x7E is a tilde in US-ASCII, so why are you calling this out?
>
> JIS X 0201's 0x7E is OVERLINE, I made clear it.
>
>>> code set 1: Microsoft Standard Character Set (a double 8-bit byte set)
>>
>> Which Microsoft standard character set? I assume it's CP932, but this
>> needs to
>> be stated explicitly.
>>
>>> * JIS X 0208-1983
>>> * NEC special characters (Row 13)
>>> * NEC selection of IBM extensions (Rows 89 to 92)
>>
>>> code set 2: Halfwidth Katakana (a single 7-bit byte set)
>>> JIS X 0201-1976
>>> requiring SS2 as the character prefix
>>
>> You're approaching this backwards, which is unfortunately a pretty common
>> problem with these ISO 2022-based charsets, including some existing
>> registratons.
>>
>> Charsets are mappings from octets to characters and should be specified
>> as such. What you're doing here is using the ISO approach, which defines
>> things in terms of one or more coded character sets and then describes how
>> to encode them using a character encoding scheme.
>>
>> I suggest that this be flipped around, to say something like:
>>
>> Octets with the high bit clear specify single US-ASCII characters, while
>> octets with the high bit set encode characters from the Microsoft Coded
>> Character Set 932 by combining the bits from the two octets ...
>>
>> the problem with the ISO 2022 approach is that once you say you're using
>> ISO
>> 2022 you then have to profile what parts of ISO 2022 are allowed. ISO 2022
>> in
>> full generality is extremeply complex and essentially nobody supports all
>> of
>> it. (And EUC charsets are some of the most limited profiles of all - they
>> assume fixed bindings of coded character sets to C0-1/G0-3 and only use
>> the
>> "shift next character" control sequences.
>>
>> I suggest you check out the specifiation of iso-2022-jp or any of the
>> other
>> iso-2022-* variants in order to see how to write this sort of description.
>
> I changed as:
>   Octets with the high bit clear specify single US-ASCII characters, while
>   octets with the high bit set encode characters from the Windows Codepage
>   932 by combining the bits from the two octets except the first octet is
>   0x8E which represents Halfwidth Katakana.
>
>>> Additional information:
>>
>>> This is a request for a new registration of this charset.
>>
>>> CP51932 is real implementation of EUC-JP mostly used by Web Browsers.
>>
>> I think what you're trying to say is that these browsers interpret EUC-JP
>> (which is already registered and differs in some details) as the charset
>> you
>> describe here rather than what's actually registered under the name
>> EUC-JP. If
>> so, you need to make that much clearer.
>
> I changed the description as:
>   Typical user of CP51932 is web browsers. When web browsers load
>   a page which are declared or auto-detected as "EUC-JP", they don't
>   interpret it as true EUC-JP registerd in IANA Character Sets but as
>   CP51932. When they post form data as "EUC-JP", the data is also
>   encoded as CP51932.
>
>> I also think some mention needs to be made of JIS X 0212 here and the
>> apparent
>> lack of a binding of it to the G3 range (which is present in EUC-JP).
>> While I
>> have no problem with dropping JIS X 0212 support - support for which is
>> sporadic at best - the rationale for not having it needs to be called out.
>
> I add description as:
>   CP51932 is a variant of EUC-JP (like Windows-31J and Shift_JIS).
>   this charset is different from EUC-JP in:
>     * CP51932 doesn't support JIS X 0212
>     * CP51932 supports characters extended by Windows Codepage 932
>     * Unicode mapping of some characters are different
>
>
>>> Internet Explorer gives a reference implementation.
>>> Firefox, Safari, Opera, and Google Chrome support also this.
>>> They refers this charset by the name "EUC-JP".
>>> http://coq.no/character-tables/mime/euc/en
>>
>> I'm not sure references to incorrect definitions of other charsets are
>> appropriate or useful.
>
> I removed it.
>
>
> Now it comes as following:
> ----------
> Charset name: CP51932
>
> Charset aliases: (none)
>
> Suitability for use in MIME text:
>
>   Yes, CP51932 is suitable for use with subtypes of the "text"
>   Content-Type. Since CP1932 is an 8bit charset Care should be
>  taken to choose an appropriate Content-Transfer-Encoding.
>
> Published specification(s):
>
>   Octets with the high bit clear specify single US-ASCII characters, while
>   octets with the high bit set encode characters from the Windows Codepage
>   932 by combining the bits from the two octets except the first octet is
>   0x8E which represents Halfwidth Katakana.
>
>   Meaning and mapping to Unicode of each character is refer to
>   Windows Codepage 932.
>   http://msdn.microsoft.com/en-us/goglobal/cc305152.aspx
>
> ISO 10646 equivalency table:
>
>   http://cpansearch.perl.org/src/NARUSE/Encode-EUCJPMS-0.07/ucm/cp51932.ucm
>
> Additional information:
>
>   This is a request for a new registration of this charset.
>
>   CP51932 is a variant of EUC-JP (like Windows-31J and Shift_JIS).
>   this charset is different from EUC-JP in:
>     * CP51932 doesn't support JIS X 0212
>     * CP51932 supports characters extended by Windows Codepage 932
>     * Unicode mapping of some characters are different
>
>   Typical user of CP51932 is web browsers. When web browsers load
>   a page which are declared or auto-detected as "EUC-JP", they don't
>   interpret it as true EUC-JP registerd in IANA Character Sets but as
>   CP51932. When they post form data as "EUC-JP", the data is also
>   encoded as CP51932.
>
>   The name "CP51932" is in use following applications:
>     * Citrus iconv (NetBSD and DragonFly uses this)
>     * patched GNU libiconv in FreeBSD ports
>     * Mojikan http://www.mirai-ii.co.jp/moji/mojikan/
>     * nkf 2.0.5
>     * PHP 5.2.1
>     * Ruby 1.9.1
>     * Encode-EUCJPMS-0.06
>
>   Moreover applications which uses MLang.DLL or .NET Framework for
>   converting "EUC-JP" implicitly uses this charset.
>
>   So this charset is widely used, but doesn't have its own name.
>   Intended use of this name is to override the implementation of EUC-JP
>   or charset convertion.
>   http://wiki.whatwg.org/wiki/Web_Encodings
>   http://www.w3.org/Bugs/Public/show_bug.cgi?id=7444
>
>   Why the name is not "Windows-51932" is some of applications which accept
>   the name "CP51932" don't support the name "Windows-51932".
>
>   CP51932 is for use of importing legacy data.
>   UTF-8 is preferred to CP51932 for new system.
>
>   Related references are:
>
>     "Remarks" of "GetEncodings Method" of "System.Text"
>
> http://msdn.microsoft.com/en-us/library/system.text.encoding.getencodings.aspx
>
>     "UnicodeによるJIS X0213実装入門―情報システムの新たな日本語処理環境"
>     日経BPソフトプレス, ISBN 978-4891006082, 2008, p. 17-18, 20, 120-158
>
>     CP51932 - Legacy Encoding Project
>     http://legacy-encoding.sourceforge.jp/wiki/index.php?cp51932
>
>   This charset is also known as Windows Codepage 51932.
>
> Person & email address to contact for further information:
>
>   NARUSE, Yui
>   Email: naruse@airemix.jp
>
> Intended usage: LIMITED USE
>
> --
> NARUSE, Yui  <naruse@airemix.jp>
>