[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Don't we need a standard way to represent language in Unicode?



[ietf-charsets' charter is to decide how best to represent text on
 the Internet, now that ASCII is no longer enough for most Internet users.
 Most members of the list seem happy with Unicode.  Mr. Ohta is violently
 opposed, and has proposed extending Unicode with several bits *per character*
 to indicate language.  When that was shot down, he proposed an extension 
 of ISO2022 instead which completely ignores Unicode.  I think midway
 between Mr. Ohta's two proposals might make more sense.  -dan]
 
I am concerned that Japan may ignore Unicode [see the archives
of INSOFT-L referred to in my last message] because it fails to address
an important need from their point of view: encoding language.

A mixed Korean/Japanese/Chinese document in *plain* Unicode CANNOT be
displayed in a palatable way.  This renders plain Unicode unacceptable
for transmitting this type of document over the Internet.
Worse, there is no standard way of marking up a Unicode document to 
indicate language, so even adorned Unicode cannot be used interoperably for 
this kind of document on the net.

Of course, we could wait for UCS-4 to solve this problem- but it isn't
anywhere near ready, won't be for many years, and IMHO is overkill for
the problem at hand.

A quick and dirty way to address the problem would be to define a set of 
control codes as an extension of Unicode to indicate language, in much 
the same way as ISO2022 defines control codes to switch character sets.
Display applications which do not support different fonts for
different languages can simply ignore the codes.
Applications which deal with non-Han languages need not bother with
the codes, as plain Unicode is sufficient for those languages.
The codes should cause little overhead, as most documents do
not change language very frequently, and they can in any case be omitted
when not needed.

Unless something like this is done in a way that gains at least 
grudging acceptance in Japan, we may not end up with a truly interoperable
method of representing text on the Internet!

Folks, do you want Unicode to be the universal way to represent text, as I do?
Do you agree that there is a serious disconnect with Japan on the
usability of Unicode for mixed C/J/K text?
Isn't a standard way of layering language encoding on Unicode desirable?
Or am I way out in left field here?
- Dan Kegel (dank@alumni.caltech.edu)