[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Registering GBK and GB18030 in the IANA charset registry



Hello all,

I hereby propose the inclusion of GBK and GB18030 charsets in the IANA
charset registry.

(Hope you don't mind all the CCs.  I think it would be nice if all the
GB18030 experts can comment and contribute to this registration
as a community effort.  :-)

GB2312 (1980) has been superceded by GBK (circa 1993?) and GB18030 (2000).
GBK has been widely used by mainland Chinese for a very long time, and
GB18030, which supercedes GBK, is a mandatory standard in Mainland China
August 30, 2001.

GBK extends GB2312 to include the CJK compatibility area defined in
Unicode 2.1.  GBK quickly became very popular in China. All major
GNU/Linux and UNIX platforms (Red Flag, XteamLinux, Turbolinux,
BluePoint, COSIX, etc.), as well as Microsoft Windows, have supported
GBK for years.  It is equivalent to codepage 936 in Windows.
Many web pages already use GBK encoding.  For example, the character
"Rong" in Premier Zhu Rongji's is missing from GB2312 and can be
displayed only in GBK.

GB18030 further extends GBK.  It covers 1-byte, 2-byte and 4-byte
codepoints while maintaining full backward compatibility with GB2312
and GBK.  It specifies a roundtrip conversion to and from
Unicode/ISO-10646-1, and the 4-byte portion of GB18030 is calculated
algorithmatically to map to corresponding codepoints in
Unicode/ISO-10646-1.  Thus, this will be the first Chinese national
standard that covers all ethnic languages (Chinese, Tibetan, Mongolian,
etc.) used in China.

On behalf of fellow Chinese, I would really love to see GBK and GB18030
recognized as official charsets by IANA.

Indeed, zh_CN.GBK and and zh_CN.GB18030 have been supported in glibc
and GNU iconv for quite some time.  Patches to add GB18030 support
exist for XFree86 4.1.x and Qt-2.3.x.  Also, Mozilla, Netscape and
MSIE already also recognize and support both GBK and GB18030.
A test page (courtesy of James Su) is at:

	http://www.turbolinux.com.cn/~suzhe/

This page (Content-Type: text/html; charset=gb18030) can be displayed
in full under Turbolinux 7.0 and XteamLinux 4.0.  After installing the
GB18030 upgrade by Microsoft, the page also displays correctly under
Windows NT/2000/XP, albeit with some fonts missing as the font provided
by Microsoft isn't as complete.  ;-)

Note that even in e-mail and webpages where GBK is used, Microsoft still
calls it "charset=gb2312".  It is a misnomer, but perhaps a compromise
to maintain backward compatibility, and perhaps because GBK isn't yet in
IANA.  There are also some people who may have used "x-gbk" and "x-x-gbk",
but of course, that is also non-standard.  It would be best if we
standardize this to "GBK" once and for all.  Afterall, GBK is a Chinese
national specification used by millions if not billions of people.
It is not some private vendor implementation, so the use
of "x-" is inappropriate.  :-)

I am very glad that BIG5-HKSCS has been registered.  It would be
wonderful if we could get GBK and GB18030 registered too.  :-)

My question: How should we proceed?

The Chinese government has published a printed standard for GB18030.
It is in Chinese, and is unfortunately not available on-line.
However, an unofficial yet authorative GB18030 Summary written by
Dirk Meyer at Adobe Systems is at:

  ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf

The complete GB18030 <-> Unicode conversion data (mappings and ranges)
are defined here:

  http://oss.software.ibm.com/cvs/icu/~checkout~/charset/data/xml/gb-18030-2000.xml

Markus Scherer at IBM has also written some excellent documentation:

  http://oss.software.ibm.com/cvs/icu/~checkout~/charset/source/gb18030/gb18030.html

Fellow Chinese developers like James Su, Wang Shouhua, Wu Jian, Leon
Zhang, etc. have also posted some GB18030 papers in Chinese on the Internet.

I am not sure how to contact the official Chinese standard committee
who defined the GB18030 standard, but I am sure some of you may know. 
:-)

I just found a copy of the Big5-HKSCS registration on-line.  I guess we
can use that as a template, and follow RFC 2278 to write a formal
application for GBK and GB18030 (in ASCII) and submit it.  BTW, that
registration is at:

   http://lists.w3.org/Archives/Public/ietf-charsets/2000OctDec/att-0000/01-Submisson_to_IANA.txt

   (Yes, there is a small typo: Submisson instead of Submission.  :-)

Any comments, suggestions and guidance are welcome!  :-)

Best Regards,

Anthony Fok

-- 
Anthony Fok Tung-Ling
ThizLinux Laboratory   <anthony@thizlinux.com> http://www.thizlinux.com/
Debian Chinese Project <foka@debian.org>       http://www.debian.org/intl/zh/
Come visit Our Lady of Victory Camp!           http://www.olvc.ab.ca/