[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Fwd: Last Call: UTF-16, an encoding of ISO 10646 to



Resend.

To: Harald@Alvestrand.no
Subject: RE: Fwd: Last Call: UTF-16, an encoding of ISO 10646 to
  Proposed
Cc: ietf_charsets@iana.org, kenw@birdie.sybase.com, mark.davis@us.ibm.com

Harald,

> 
> >Plane 2: SIP  ~47000 Han characters (Vertical Extension B) under ballot
> >               ~18500 still assignable code points
> 
> [haring off on an unrelated topic]
> one thing I'm not connected enough to have found out:
> Is this large set *new* characters, or are parts of the set used for 
> rolling back the unifications that created so much tension in and around 
> the IRG, for people who want that?
> 

Well, it is a large set of *old* characters. ;-)

These do not roll back the unifications, but instead constitute the
third tier of really, really, obscure Chinese characters, after
Vertical Extension A (the 6582 that just got added to the BMP for
Unicode 3.0).

The IRG is still observing the unification rules for this batch of
47,000. The largest part of these consist of all the other characters
from the big source dictionaries (like the KangXi dictionary) that
were not included in the first two sets. Most of them are truly obsolete
characters with one or a few citations in old Chinese classics, or they
constitute mistakes and variants that got into print (partly because
of the long history of wood-block printing in China) and from there
into the dictionary compendia. Many of the dictionary entries for these
consist of one-liners: XX See YY.  or XX Variant of YY.  or XX Mistake
for YY. etc.

However, included in this batch are also a number of modern-use
(and important) characters from other sources that didn't make it
into the first two batches -- including a group of characters on
the Hong Kong government's required list, additional corporate and
national characters from Japan, China, Vietnam, etc. The total number
of these is relatively small, but they are much more important for
general implementation, since they have usage outside of the text
entry of ancient Chinese (and other Han) classics.

The issue of "compatibility characters" has raised its ugly head
with the pending publication of JIS X 0213. A number of variants
that are already unified in Unicode are separately encoded in JIS X 0213
as required Japanese national variants (to match some legal requirements
for appearance of characters). This rather limited set was, in fact,
originally proposed by the UTC to be included among the compatibility
Han characters, but the issue was dropped during the development of
the URO. But the set has come back, and UTC has accepted them as a
group of compatibility characters (approximately 56 of them, if I
remember correctly). Taiwan and China are likely to come in with
an additional collection of compatibility characters to help in the
crossmapping of CNS and GB standards. But all participants in this
want to keep the repertoire of these small and manageable -- and all
will be clearly marked as compatibility variants of specified
unified characters. Nobody involved is, at this point, interested
in starting a wholesale rollback of unification. Note that these
compatibility variants are for the explicit variants and duplicates
in existing Asian standards -- we are *not* talking here about
coming up with a full set of encodings for "Chinese" Hanzi characters,
and another full set of encodings for "Japanese" Kanji characters,
and another full set of encodings for "Korean" Hanja characters.
Everybody knows that that way lies madness.

--Ken