[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Some thoughts ..




Martin Duerst commented ... "other thoughts / opinions are welcome ... "

Here are some points I would like see addressed ..
============
One of the things that needs to be nailed down is what this IANA charset
registry is about...

a. The definition of charset registered:
If we take the definition to be something like:
"Given a set of hex values in the charset their corresponding assigned
characters -- identified using character names (pre UCS - Unicode/10646)
and U+xxxxxx (post UCS)."
The resource provided in support of the registration should be at least in
support of the above definition.

Discussion about mappings between charsets is another topic ... if you
think about it. There are different criteria one could apply during mapping
between charsets --- UCS can be viewed as another charset from this
perspective.

For historical reasons, some characters in the non-UCS sets have multiple
names ... for ex: overbar, macron or apostrophe, single quote etc. in
earlier Latin-1 charset standards.  These had to be dealt with additional
aliasing / equivalencing information in practice, though could not be
easily documented with just one U+xxxxxx value assignment.

For example, see IBM CDRA's chapter on difference management:
http://www-304.ibm.com/jct03002c/software/globalization/cdra/chapter6.jsp
for one perspective.

b. Registration of mapping between charsets ... i.e. from one set of code
points to another set of code points is an art on its own ..  and I am not
sure if the charset registry is meant to include such mappings..  Also, if
one uses UCS as an intermediary for going from one charset to another
charset versus direct mappings, you can arrive at different results
depending on the criterion used for the difference management.

If we restrict our definition to the first -- the fallback mappings from
UCS back to the charset being defined should enter the picture only to
address the possible multiple naming choices in the (especially older)
nonUCS charsets.

The IANA Charset registry has to first nail down what is being registered.
Earlier registrations were restricted to the one way mapping from Code
Points from charset to names and later registrations being required to be
associated with corresponding U+xxxxxx.

c. Control character sets are equally vagrant ... in the sense that
multiple assignments could be made to ISO 2022 based sets by invoking
different C-sets as needed.  While there are defaults such as specified in
ISO 6429 (same as ASCII C0 set and a default C1 set) while such invocation
for C-set is absent ... With so called 'ASCII extended' sets like in PC DOS
or Windows charsets, dual use of C0 zone for graphics and controls, use of
C1 zone for graphics etc. as well as running into multiple usages due to
DOS use of C0-x1A for end of file vs being a SUB...  and x7F becoming a SUB
instead of being a 'DEL' or something like that etc.  We also run into
CRLF, vs LF, vs NEL etc. for line ending ...  we run into dual use of YEN /
WON and Backslash for Syntactic vs Linguistic use since many have hardcoded
that hex value for file separators and the like and so on.  What you will
encounter in myriads of mapping tables that exist even between two defined
charsets are the variations to accommodate such customizations ..

d. Another thing that has to be nailed down is:
      When one uses a charset=label, what is its semantic.  Is it that the
hex values for characters (graphic or control ??) used in the text in this
document is per definition of the 'label' in the IANA registry...
      or I have used a converter named 'label' (to or from UCS) to generate
the data in this document
      (and if I have converted directly from a latin-1 EBCDIC page to
latin-1 8859-1 page (without UCS as intermediary) how would I label the
data?)

      If it is the former then the 'label space' can be managed..  if it is
the latter 'all the vagaries / variants of conversions' need to be
accommodated in the 'labeling space'.

Jumping directly into what the resource format defining a charset should
contain without nailing down some of the above points, in my view, is
raising many of the points being raised recently about fallbacks, best fits
etc. in this list.

Best regards, Uma





V.S. UMAmaheswaran, Ph.D.
Globalization Centre of Competency, IBM Toronto Lab
A2/SZ8, 8200 Warden Avenue, Markham, ON, Canada, L6G1C7; +1 905 413 3474;
Fax:905 413 4682; TieLine 969; email: umavs@ca.ibm.com