[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Volunteer needed to serve as IANA charset reviewer



On Wed September 6 2006 18:58, Ned Freed wrote:
> > I concur with the need to maintain the current charset registry to
> > support legacy apps that use it.
> 
> > And I think Ned would be an excellent choice for reviewer, though it
> > wouldn' t bother me if he could have the assistance of people with
> > specialized expertise in Asian writing schemes.
> 
> Any such assistance would be hugely welcome. As an aside, it would also be nice
> if more people would post comments to the list...

OK.  I concur with most of what has already been said by others, specifically
that if a charset (i.e. something meeting the definition of charset) is in
use, it ought to be registered; using the registry as a way to force some
agenda is a very bad idea.  Also that Ned would be an excellent choice for
reviewer, and I would add that I fully support his stated plan to overhaul
the existing registry, which has long been in need of such an overhaul (e.g.
the registration procedure has long said that "ASCII" is disallowed, yet it
is in fact registered as an alias).

A few differences of opinion:
Keith Moore wrote:
> > But I do think that use of
> > multiple CESs in a new protocol should require substantial
> > justification, and that UTF-8 should be presumed to be the CES of
> > choice for any new protocol that requires ASCII compatibility for its
> > character representation.

There may well be areas of application for new protocols which cannot fully
support Unicode which underlies use of utf-8, due to character set size,
huge tables needed for normalization, etc. (see sections 3.1 (paying particular
attention to "memory-starved microprocessors") and 3.4 of RFC 1958).  Not all
protocols need to fully support utf-8 directly; the highly successful mail
system, for example, supports only a subset of ANSI X3.4 in message header
fields, yet it allows pass-through of utf-8 and other charsets via RFC 2047
mechanisms as amended by RFC 2231 and errata.

Ted Hardie wrote:
> > This question is motivated, not by a strong love for Unicode,
> > but by the observation that RFC 2277 requires it and that the
> > IETF is shifting toward it in a number of areas.

To be precise, RFC 2277 says:
"   Protocols MUST be able to use the UTF-8 charset, which consists of
   the ISO 10646 coded character set combined with the UTF-8 character
   encoding scheme, as defined in [10646] Annex R (published in
   Amendment 2), for all text.

   Protocols MAY specify, in addition, how to use other charsets or
   other character encoding schemes for ISO 10646, such as UTF-16, but
   lack of an ability to use UTF-8 is a violation of this policy; such a
   violation would need a variance procedure ([BCP9] section 9) with
   clear and solid justification in the protocol specification document
   before being entered into or advanced upon the standards track.

   For existing protocols or protocols that move data from existing
   datastores, support of other charsets, or even using a default other
   than UTF-8, may be a requirement. This is acceptable, but UTF-8
   support MUST be possible.

   When using other charsets than UTF-8, these MUST be registered in the
   IANA charset registry, if necessary by registering them when the
   protocol is published.
"
Several points:
1. "MUST be able to use" is a bit different from "requires" (see the above
   example of the mail system, which is able to use utf-8 by the mechanisms
   noted, but which does not require and in fact cannot directly accommodate
   raw utf-8).
2. The explicitly stated policy of allowing alternative charsets is important.
3. Most important, note that 2277 explicitly requires registration.

> I have no problem with UTF-16 or
> UTF-32 if there is a compelling reason to allow them,

Well neither (as well as their "BE" and "LE" variants) is suitable for use
with MIME text types, which precludes their use in a number of important
applications.  And one thing the charset registry sorely needs is a more
explicit indication of which charsets are/are not suitable for such use
(heck, some registrations have lacked the required statement of
[un]suitability, so even groping through all of the registrations is of
no use (and don't get me started on RFC 1345 issues)).