[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

ietf-charsets archives [was: Re: Unicode progress]



I just finished reading the ietf-charsets archives.
It looks like [meta?-]debate was still raging furiously as of a month ago
on the ietf-charsets list over goals and encoding.
Let me provide a novice's summary, for the possible benefit of wnils people.
I apologize in advance for any inaccuracies.

---
Everybody wants to work together to define a super character set to handle 
all the world's languages.

It seems there are two major families of super character sets,
ISO 2022 (for example, used by the X consortium and the MULE text editor), and 
ISO 10646 (for example, used more or less by Plan 9 and the SAM text editor).

MIME chose to specify the character set used in the header of each item,
but this approach is not viewed as promising for the future.
RFC 1345 is ISO 16046 based, and defines a representation of ISO 16046 
using 'mnemonic' sequences.

ISO 2022 is seen as complex and too stateful to be promising for the future,
although it has seen real world use.

ISO 16046 is seen as the main development path.  It has several variants
and several possible encoding schemes.  Unification of Han characters from
different languages was tried in the Unicode variant, and met with strenuous
objection, although Westerners don't understand quite what the issue is.
So Unicode is not the answer, although something close to Unicode will be.
UCS-2, UCS-3, and UCS-4 seem to be ISO 16046 related character sets of roughly
8^2, 8^3, and 8^4 codes each.  UTF-2 is a proposed encoding for all of these.

The final solution will not be based on 16-bit wide characters externally,
but will rather be >16 bits internally, and use variable length representation
externally, partly for compatibility with ASCII, partly to represent common 
symbols with short codes, even for far east languages.  It will allow 
intermixing of dozens of languages within the same paragraph.

The issues of bidirectionality, comparison, and equality testing were
mentioned very briefly- they may have been discussed offline at a BOF.
Perhaps someone familliar with these issues could post a word or two about
them.

A list of eight or ten important properties of the final encoding solution
were posted and more or less agreed to.
Several proposals for the encoding were posted, with no clear winner.
Most try to be compatible with the UTF-2 encoding.

For several months, people argued and had trouble communicating, although
they seemed in basic agreement about goals.  The list has been silent
for about a month until the Unicode cross-post from ietf-wnils.
---
- Dan Kegel