[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

ECMA-cyrillic alias iso-ir-111 sore



Charset gurus,

There is a long-standing problem with one IANA-registered Latin/Cyrillic
charset known as ISO-IR-111 or ECMA-Cyrillic. This charset is the European
variant of the popular Soviet standard KOI-8. KOI-8 is a popular 8-bit Latin/
Cyrillic charset whose low half is US-ASCII or ISO_646.irv:1983 (depending on
your preference of the dollar sign or the international currency symbol at code
point 44 octal), whereas the high half has Russian letters at code points 300
through 377 octal in the so-called KOI correspondence order (an order such that
if bit 7 of KOI-8 Russian text is stripped a case-reversed English
transliteration is produced). The general term "KOI-8" (not registered with
IANA) means the above, but says nothing about code points 200 through 277
octal. Systems based on ISO standards generally interpret octets 200-237 octal
as C1 high controls per ISO 6429. Octets 240-277 octal are supposed to be
graphic (GR) characters in the ISO world, but the general term "KOI-8" leaves
them undefined.

The charset registered in the ISO International Register under No. 111
(ECMA-Cyrillic) was ECMA's version of KOI-8 with Russian letter IO and
Belorussian, Ukrainian, Serbocroatian, and Macedonian characters assigned to
code points 240-277 octal which are left undefined by the general term "KOI-8".
A scanned image of the official (paper) registration document defining
ISO-IR-111 can be found in:

http://www.itscj.ipsj.or.jp/ISO-IR/111.pdf

Examination of the above document reveals that the charset registered in ISO-IR
under No. 111 is indeed as described above, a KOI-8 variant with the Russian
letters in the KOI correspondence order. The problem is that the current IANA
character-sets document lists RFC 1345 as the primary reference for this
charset, and the description of this charset in RFC 1345 is seriously in error.
RFC 1345 lists the upper characters of ISO-IR-111 in a completely wrong order,
effectively defining a totally different charset (a mix between ISO_8859-5:1988
and windows-1251 no less!).

Since the only Internet document describing charset ISO-IR-111 is the erroneous
RFC 1345 and since while acknowledging the ISO-IR registry as the original
source the IANA character-sets document still lists RFC 1345 as reference with
no warning about it being in error, it is certain that of the people
implementing Internet charset handling software few have had any reason to look
at the ISO-IR registration document and most have instead logically assumed
that RFC 1345 had the correct definition of ISO-IR-111. As a result it is
certain that a great quantity of Internet software in use today interprets
charset names "ECMA-Cyrillic" and "ISO-IR-111" as meaning the mix of
ISO_8859-5:1988 and windows-1251 defined in RFC 1345 rather than the charset
registered in ISO-IR under No. 111.

This situation creates a problems for people wishing to use the charset
registered in ISO-IR under No. 111 on the Internet. While ISO_8859-5:1988 is
the current international standard (the current Russian Federation GOST
standard is similar) and places the Russian letters in their native alphabetic
order, the older KOI-8 standard is still popular in many environments. The
people's love of KOI-8 no matter what the current standards say is the reason
why most of the Internet today uses KOI8-R charset (RFC 1489) for Russian text.

However, KOI8-R has a feature making it unsuitable for some environments.
Specifically, KOI8-R defines code points 200-237 octal as graphic characters,
and such use of these code points cannot be correctly handled by terminal
equipment (e.g. DEC VT300 terminal series) and text processing software (e.g.
the terminal drivers and text editors in some versions of UNIX) designed for
the ISO world in which these code points are ISO 6429 control characters.
People using such equipment and software and wishing to use KOI-8 must use a
version of KOI-8 other than KOI8-R. Such people naturally want a charset with
code points 0-177 octal matching US-ASCII or ISO_646.irv:1983, code points
200-237 octal being C1 controls of ISO 6429, and 300-377 octal being Russian
letters in KOI correspondence order.

What should be at code points 240-277 octal? In practice people who just want
KOI-8 don't really care, but since it usually feels better to assign a rarely
used code point to something rather than leave it completely undefined, since a
handy assignment of these code points exists in ISO-IR-111, and since those
extra characters may come useful to some people, ISO-IR-111 is naturally the
KOI-8 variant of choice for the people in circumstances described above.

This is the motivation behind the desire to use ISO-IR-111 instead of
ISO_8859-5:1988 or KOI8-R. However, in applications involving Internet
protocols the problem arises of how to label the use of this charset given the
current confused status of its IANA registration. To mend this problem I ask
IANA to take the following corrective actions:

1. Amend the character-sets document to not list RFC 1345 as a reference for
charset ECMA-cyrillic alias iso-ir-111. List the ISO-IR registry as the only
reference and add a note indicating that RFC 1345 is in error.

2. Register KOI8-E as an alias for charset ECMA-cyrillic alias iso-ir-111. The
reason for doing so is that so many people have been misled for so long into
thinking that ECMA-Cyrillic aka ISO-IR-111 is the mix of ISO_8859-5:1988 and
windows-1251 defined in RFC 1345 rather than the KOI-8 variant designed by ECMA
and defined in the ISO-IR registration document that the people wishing to use
the latter charset naturally want a different name for it. I believe that it is
best for the name to explicitly contain "KOI8" or "KOI-8" in it, and KOI8-E
(for ECMA, European, or extended) is the name used by Roman Czyborra in his
superb Cyrillic Charset Soup page:

http://czyborra.com/charsets/cyrillic.html

Thanks for reading and TIA for acting,

MS