[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: ECMA-cyrillic alias iso-ir-111 sore



Martin Duerst <duerst@w3.org> wrote:

> It would help everybody if you resent your mail with the proposals
> at the beginning of the mail, and the justification afterwards
> (i.e. US English order :-).

OK, here are my two proposals again:

1. Amend the character-sets document to not list RFC 1345 as a reference for
charset ECMA-cyrillic alias iso-ir-111. List the ISO-IR registry as the only
reference and add a note indicating that RFC 1345 is in error.

2. Register KOI8-E as an alias for charset ECMA-cyrillic alias iso-ir-111.

The reasons are in my original post and more below.

> Please note that this clearly says "Right hand part of the Cyrillic
> Alphabet". While this is really strange (the Cyrillic alphabet doesn't
> have hands), it intends to say that it defines only the right part
> (i.e. hex 0x80-0xFF) of some actual encoding.

Code points 0x00-0x7F (or 0-177 octal) coincide with US-ASCII. The ISO 2022
model defines ALL charsets by halves.

> RFC 1345 contains many other cases where only part of an actual encoding
> is identified.

I think you've missed my point. The discrepancy I'm talking about is not
whether the low US-ASCII half is spelled out or silently implied. It's the
meat, the right Cyrillic part that is listed completely incorrectly in RFC
1345. The actual charset registered with ISO-IR under No. 111 has lowercase
Cyrillic letters in ranges 240-257 and 300-337 octal and uppercase ones in
260-277 and 340-377 octal, RFC 1345 lists them the other way around. The actual
charset has Russian letters in KOI correspondence order, RFC 1345 lists them in
alphabetical order. The actual charset has the Balkan DJE and GJE before the
Russian IO, RFC 1345 lists them the other way around. This is the problem I'm
talking about. I don't see any problem with the "part of an actual encoding"
issue: it makes absolutely no difference whether the low US-ASCII half is
spelled out ad nauseum for every 8-bit charset or simply referenced once as the
default.

> It is unclear what these registrations (with labels mostly
> of the form ISO-IR-foo) are actually standing for.

No, except for the completely busted 111 I'd say the rest are perfectly fine
and clear.

(Actually there is one more blackeye in the Cyrillic charset arena, but there
 it was the [inter]national standards bodies themselves that goofed, not the
 Internet folks. The charset registered with ISO-IR under No. 153 is labeled
 GOST_19768-74 but is actually GOST_19768-87. 19768-74 was the original KOI-8
 standard. But here I'm not blaming IANA or Keld Simonsen or IETF or whomever,
 as it was either GOST or ISO clerks that goofed here: the 153 registration
 document says GOST 19768-74 on it, even though it clearly defines -87 and not
 -74. Moan.)

> It is difficult to assert 'great quantity'.

OK, the "great quantity" was a logical guess on my part. But I just did a tiny
bit of actual research:

> What would be helpful is to
> have at least one example each of:
> - Software implementing ISO-IR-111 according to the official document

GNU recode 3.5.

> - Software implementing ISO-IR-111 according to RFC 1345

GNU recode 3.4.

> But just defining another alias doesn't solve the problem of differing
> implementations.

Well, if the new alias is published simultaneously with the note in the
official character-sets document explaining what the correct charset really is,
all implementations knowing the new alias would necessarily be new ones that
implement the charset correctly. Old implementations would not recognize the
new name at all. (If someone takes the trouble of adding the new name to old
software s/he will necessarily notice the correction in the charset definition
and hopefully not produce a program that interprets the new name as meaning the
bogus RFC 1345 definition.)

But perhaps an even more important reason for registering the name KOI8-E as an
alias for ECMA-Cyrillic is that it's much more descriptive. Assume for the
moment that in a given system the recognition of charsets is left up to the
human user, as with a user manually looking at Content-Type: headers in a non-
MIME mailer. (Or the software implements the charset incorrectly per RFC 1345
and is forced into manual mode by using an alias it doesn't recognize.) When
seen by a human user familiar with charset basics but not with the full ugly
story, the name "ECMA-Cyrillic" produces a kneejerk reaction "what's that?",
while the kneejerk reaction to "KOI8-E" would be "ahh, it's another KOI-8
variant". See the difference? I would much rather get the latter reaction. With
that reaction the silly mistake of RFC 1345 would probably have never happened
in the first place, it certainly resulted from the former reaction.

> If we want to clear up things completely, a new registration
> would be much better.

It would be fine with me, but what about IANA? The charset registration
procedure does not invent new charsets, it merely catalogs ones invented by
others. So however it's registered with IANA, the actual charset (the normative
reference) is the ISO-IR document. We have an IANA registration for this
charset. A troubled one, but existing nonetheless. How can you have two
independent IANA registrations for one actual charset (one normative
reference)? Or actually you can, and it's called an alias. That's what I was
getting at.

MS