[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Registration of new charset [ISO-2022-JP-2004]

To: Koichi Yasuoka <[email protected]>
Subject: Re: Registration of new charset [ISO-2022-JP-2004]
From: Martin Duerst <[email protected]>
Date: Thu, 28 Sep 2006 16:06:56 +0900
Cc: [email protected]
In-reply-to: <[email protected]>
List-Id: <[email protected]>
List-Owner: <mailto:[email protected]>
List-Subscribe: <mailto:[email protected]?subject=subscribe%20ietf-charsets>
List-Unsubscribe: <mailto:[email protected]?subject=unsubscribe%20ietf-charsets>
Message-hash: BE66C3B8AC57B95603D64F205BA424EB
Original-recipient: rfc822;[email protected]
References: <"Martin Duerst's message of <6.0.0.20.2.20060927185100.09a62c50"@localhost><[email protected]>
Spam-test: False ; 0.0 / 4.5 ; UNPARSEABLE_RELAY

At 00:28 06/09/28, Koichi Yasuoka wrote:
>Dear Martin,
>
>Thank you for your reply about the registration of ISO-2022-JP-2004.
>However, I almost give up the registration...

Please don't.

I didn't think this would happen so quickly,
and I was somehow expecting a public announcement, but as
of a few hours ago, IANA has listed the new reviewers on
their page (see http://www.iana.org/numbers.html#C).
To spare people the work of checking it, the relevant line
reads:
Character Sets    RFC2978     Expert Review (Primary Expert Ned Freed
                                   and Secondary Expert Martin Duerst)

Ned and me still need to figure out some details of how we split up
our work, but we definitely hope that we can move things forward more
quickly than in the (recent) past.

>>Is the preference
>>for "B" due to tradition? Or because on average, it leads to
>>shorter encodings?
>
>"B" is shorter on average than "Q", especially encoding Japanese
>names in From, To, and Cc fields.  For example, my name in Japanese
>is encoded as:
>
>=?ISO-2022-JP-2004?B?GyRCMEIyLDknMGwbKEI=?=
>=?ISO-2022-JP-2004?Q?=1B=24B0B2=2C9=270l=1B=28B?=
>
>So as ISO-2022-JP.  Please try other Japanese names.

Ok, I see. Probably best to say so in the registration.

>>Also, I think that any mention of "extension of ISO-2022-JP"
>>without explanations is a bit problematic, because it might
>>give the impression that implementations accepting iso-2022-jp
>>also will somehow work for this new encoding. In my understanding,
>>because new escape sequences are used, it is extremely difficult
>>to predict what might happen in such a case.
>
>I understand that ISO-2022-JP texts with "ESC $ B" and
>"ESC ( B" can be accepted by ISO-2022-JP-2004 decoder.
>It is problematic when "ESC $ @" or "ESC ( J" is used but
>they are very rare now.

That's not the case I was worring about. I think it will
be very rare for some piece of software to only support
iso-2022-jp-2004, but not iso-2022-jp. The reverse is
much more likely, and feeding iso-2022-jp-2004 data
to a iso-2022-jp decoder is where things are extremely
difficult to predict.

>>So I think both a more detailled description and a pointer to
>>machine-readable data would be highly appreciated by anybody who
>>wants to implement this.
>
>Yes, I agree that the machine-readbale table for the
>conversion between ISO-2022-JP-2004 and ISO/IEC 10646 is
>highly appreciated, but I didn't get it when I proposed
>the registration.  Now I know "ISO-2022-JP-2004 vs Unicode
>mapping table" at http://x0213.org/codetable/iso-2022-jp-2004-std.txt

This looks great. Please include this pointer in the
registration. If necessary, you can mention that it is
not normative/official.

>>The 2000 version of JIS X 0213 also contains ISO-2022-JP-3.
>>What's the reason for leaving that out of the registration?
>>Is the reason that there were changes in the Unicode mappings
>>of JIS X 0213?
>
>I thought I would regist them one by one.  First ISO-2022-JP-2004,
>second EUC-JIS-2004, third Shift_JIS-2004, fourth ISO-2022-JP-3,
>fifth EUC-JISX0213, sixth and the last Shift_JISX0213.  The latter
>three encodings' "Intended usage" should be "OBSOLETE".

Okay, that makes sense. But probably, it's better to speed
things up a bit by doing EUC-JIS-2004 and Shift_JIS-2004
together, and later doing all the obsolete ones together.

By the way, you said
   Intended usage: COMMON
in your inintial registration form.

Regarding this, I find the following in http://www.ietf.org/rfc/rfc2978.txt:
   A charset should therefore be registered ONLY if it adds significant
   functionality that is valuable to a large community, OR if it
   documents existing practice in a large community.  Note that charsets
   registered for the second reason should be explicitly marked as being
   of limited or specialized use and should only be used in Internet
   messages with prior bilateral agreement.
It sounds to me as if
   Intended usage: LIMITED
might fit better. But I'm not totally familiar with the
usage patters.

>>Are there any changes in Unicode/ISO 10646 mappings between
>>2003 and 2004? If yes, what?
>
>Do you mean "between 2000 and 2004"?

No, I explicitly wanted to ask for changes from 2003 to 2004.
The reason why I asked was that if the 2003 version introduces
the labels ISO-2022-JP-2003 and ISO-2022-JP-3-2003 (which you
propose to use as aliases), and there were no changes between
2003 and 2004, it is difficult to explain why to also introduce
the label ISO-2022-JP-2004.

In general, if a standard is updated or republished, there
are either changes that warrant a different label, in which
case the old labels should not be used as aliases, or there
are no relevant changes, and in this case, introducing
a new label is a bad idea.

>If so, I say yes.  UCS for
>2-93-27 was changed from 9B1D into 9B1C (well, the codepoint has
>much complicated history).

This looks like a near miss, based on two glyph shapes that look
very similar, with components that are at least occasionally
used interchangeably.

As far as I'm aware, there were much more drastic changes between
2000 and 2004. As an example, the 2000 version gives (31D3)
as the Unicode/ISO 10646 codepoint for a hiragana "ke" with a small ring.
This has been corrected to a composition of U+3051 and U+309A
in the 2004 version. There are five such hiragana examples and x
nine katakana examples. There are also some Latin and Greek characters
with similar changes.
Also, the circled numbers from 21 to 50 have different Unicode/ISO
10646 mappings in the 2000 version and in the 2004 version.
And then there are quite a number of Kanji (I haven't counted them)
that contain some mappings in the 2000 version that had to be
fixed. As an example, the character numbered as 3-2E22  in
http://x0213.org/codetable/iso-2022-jp-2004-std.txt has a
code-point (AAA2) in the 2000 version, but the actual character
in Unicode is at U+2000B, and the 2004 version corrects this.
Basically, all Unicode points that the 2000 version put into
parentheses are suspect to change (and most of them actually
changed).

So I agree with your assessment that any labels that refer to
the 2000 version should be classified as "Obsolete".

>Furthermore, the 2004 version of
>JIS X 0213 includes ten more characters than the 2000 version.

Were these new characters introduced in 2003 or in 2004?

Regards,     Martin.

#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:[email protected]

Follow-Ups:
- Re: Registration of new charset [ISO-2022-JP-2004]
  - From: Erik van der Poel <[email protected]>

Prev by Date: Re: Registration of new charset [ISO-2022-JP-2004]
Next by Date: IANA Character Set Registration Submittal
Prev by thread: Re: Registration of new charset [ISO-2022-JP-2004]
Next by thread: Re: Registration of new charset [ISO-2022-JP-2004]
Index(es):
- Date
- Thread