[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Proposal for additional Aliases to IANA registry of character sets



For better or worse, the IANA registry is used as a central repository of names for character set mappings. In particular, the XML Standard (http://www.w3.org/TR/REC-xml) is driving the registration of many encodings:

4.3.3 Character Encoding in Entities
...

It is recommended that character encodings registered (as charsets) with the Internet Assigned Numbers Authority [IANA-CHARSETS], other than those just listed, be referred to using their registered names; other encodings should use names starting with an "x-" prefix. XML processors should match character encoding names in a case-insensitive way and should either interpret an IANA-registered name as the encoding registered at IANA for that name or treat it as unknown (processors are, of course, not required to support all IANA-registered encodings).
...

The IANA registry is thus serving the very important function of cross-correlating the different terms for charsets used in a great many different functions. On the principle of lenient acceptance, additional aliases should be allowed. Of course, the recommended names should be strongly preferred, in whatever is output.

Mark
___
mark.davis@us.ibm.com
IBM, MS 50-2/B11, 5600 Cottle Rd, SJ CA 95193
(408) 256-3148
fax: (408) 256-0799

ned.freed@mrochek.com




          ned.freed@mrochek.com

          2002.08.06 09:52



To: Uma Umamaheswaran/Toronto/IBM@IBMCA
cc: Chris.Newman@Sun.COM, ietf-charsets@iana.org
Subject: Re: Proposal for additional Aliases to IANA registry of character sets


> Chris:

> As far as I know, the IANA registered names are also used for INTRANET
> using IETF protocols.

That's perhaps true but beside the point. We're dealing with a parameter
namespace for _Internet_ protocols here.

> The IBM corporate registration (started long before the IETF age) has been
> including in its numbering scheme coded character sets from many different
> sources -- including several of the ISO 7-bit, 8-bit standards, the many
> ISO-2022 scheme based national standards, several non-IBM-vendor defined
> etc.etc. These numbers become aliases in the form IBM-nnnnn etc.

> If we state that the "the primary names assigned in the IANA registry is
> the name that is 'Strongly Encouraged to be used" for OPEN interchange
> (when the charset is not constrained in any manner)" then use of strings of
> ISO-8859-1 etc. can be promoted widely. In these cases, the ALIASEs are
> meant to be used for "limited use contexts". With the printer MIB
> numbers, even in the IETF open context there will be multiple 'names'.

The problem is that the present registry doesn't support making such
distinctions. The current intent is that the primary name should be used but
all aliases should be recognized by all implementations.

As such, adding new aliases to an existing and widely used charset means
updating a very large installed base. Products that support on-the-fly updating
of charset tables are the exception, not the rule.

This makes such changes potentially VERY disruptive.

Nor would making the distinction you propose go far enough IMO. To be truly
effective there would need to be a way of listing a set of aliases for a given
charset that cannot conflict with other names and aliases yet MUST NOT be used
on the Internet.

> Unfortunately any ALIAS has a tendency to leak and they have to be
> equivalenced by implementations expecting to respect the aliases. Short of
> BANNING aliases this cannot be avoided.

By acting to add such aliases to the general list we are basically saying
that implementations done in good faith in accordance with the standards
are a fault, whereas implementations that violated the standards are not.

I strongly object. This is no way to run a railroad.

Now, I wouldn't be happy but 'd perhaps reach a different conclusion given
evidence of widespread use of an unregistered alias for a charset on the
Internet. But you yourself have stated that the issue here is use in limited
contexts.

> Also as Markus has stated singling out the 8859-1 related IBMxxxx is not
> justified, neither deleting the existing aliases for many of the ISO
> standards, without impacting many existing implementations.

You have it exactly backwards. Adding additional aliases to commonly used
charsets is an act that singles out compliant implementations. Refusing
to add them singles out incompliant implementations only. And absent
any indication such implementations are commonplace, I have absolutely
no problem with that.

Ned


GIF image