[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Fwd: Volunteer needed to serve as IANA charset reviewer



Forwarding a contribution from Mark Davis.

--Ken

------------- Begin Forwarded Message -------------

From: Mark Davis <mark.davis@icu-project.org>
Date: Sep 6, 2006 4:44 PM
Subject: Re: Volunteer needed to serve as IANA charset reviewer
...

If the registry provided an unambiguous, stable definition of each charset
identifier in terms of an explicit, available mapping to Unicode/10646
(whether the UTF-8 form of Unicode or the UTF-32 code points -- that is just
a difference in format, not content), it would indeed be useful. However, I
suspect quite strongly that it is a futile task. There are a number of
problems with the current registry.

1. Poor registrations (minor)
There are some registered charset names that are not syntactically compliant
to the spec.

2. Incomplete (more important)
There are many charsets (such as some windows charsets) that are not in the
registry, but that are in *far* more widespread use than the majority of the
charsets in the registry. Attempted registrations have just been left
hanging, cf http://mail.apps.ietf.org/ietf/charsets/msg01510.html
<http://www.google.com/url?sa=D&q=http%3A%2F%2Fmail.apps.ietf.org%2Fietf%2Fchars
ets%2Fmsg01510.html>

2. Ill-defined registrations (crucial)
  a) There are registered names that have useless (inaccessable or unstable)
references; there is no practical way to figure out what the charset
definition is.
  b) There are other registrations that are defined by reference to an
available chart, but when you actually test what the vendor's APIs map to,
they actually *use* a different definition: for example, the chart may say
that 0x80 is undefined, but actually map it to U+0080.
  c) The RFC itself does not settle important issues of identity among
charsets. If a new mapping is added to a charset converter, is that a
different charset (and thus needs a different registration) or not? Does
that go for any superset? etc. We've raised these issues before, but with no
resolution (or even attempt at one) Cf.
http://dev.icu-project.org/cgi-bin/viewcvs.cgi/*checkout*/icuhtml/design/charset
_questions.html<http://www.google.com/url?sa=D&q=http%3A%2F%2Fdev.icu-project.or
g%2Fcgi-bin%2Fviewcvs.cgi%2F*checkout*%2Ficuhtml%2Fdesign%2Fcharset_questions.ht
ml>

As a product of the above problems, the actual results obtained by using the
iana charset names on any given platform* may vary wildly. For example,
among the iana-registry-named charsets, there were over a million different
mapping differences between Sun's and IBM's Java, total.

* "platform" speaking broadly -- ithe results may vary by OS (Mac vs Windows
vs Linux...), by programming language [Java) or by version of programming
language runtime (IBM vs Sun's Java), or even by product (database version).

In ICU, for example, our requirement was to be able to reproduce the actual,
observeable, character conversions in effect on any platform. With that
goal, we basically had to give up trying to use the IANA registry at all. We
compose mappings by scraping; calling the APIs on those platforms to do
conversions and collecting the results, and providing a different internal
identifier for any differing mapping. We then have a separate name mapping
that goes from each platform's name (the name according to that platform)
for each character to the unique identifier. Cf.
http://icu.sourceforge.net/charts/charset/.

And based on work here at Google, it is pretty clear that -- at least in
terms of web pages -- little reliance can be placed on the charset
information. As imprecise as heuristic charset detection is, it is more
accurate than relying on the charset tags in the html meta element (and what
is in the html meta element is more accurate than what is communicated by
the http protocol).

So while I applaud your goal, I would suspect that that it would be a huge
amount of effort for very little return.

Mark


------------- End Forwarded Message -------------

Here's what I wanted to say; but I'm being blocked -- could you forward it for me? Mark

---------- Forwarded message ----------
From: Mark Davis < mark.davis@icu-project.org>
Date: Sep 6, 2006 4:44 PM
Subject: Re: Volunteer needed to serve as IANA charset reviewer
To: Ned Freed <ned.freed@mrochek.com>
Cc: John C Klensin <john-ietf@jck.com>, Ted Hardie <hardie@qualcomm.com>, discuss@apps.ietf.org , ietf-charsets@iana.org

If the registry provided an unambiguous, stable definition of each charset identifier in terms of an explicit, available mapping to Unicode/10646 (whether the UTF-8 form of Unicode or the UTF-32 code points -- that is just a difference in format, not content), it would indeed be useful. However, I suspect quite strongly that it is a futile task. There are a number of problems with the current registry.

1. Poor registrations (minor)
There are some registered charset names that are not syntactically compliant to the spec.

2. Incomplete (more important)
There are many charsets (such as some windows charsets) that are not in the registry, but that are in *far* more widespread use than the majority of the charsets in the registry. Attempted registrations have just been left hanging, cf http://mail.apps.ietf.org/ietf/charsets/msg01510.html

2. Ill-defined registrations (crucial)
  a) There are registered names that have useless (inaccessable or unstable) references; there is no practical way to figure out what the charset definition is.
  b) There are other registrations that are defined by reference to an available chart, but when you actually test what the vendor's APIs map to, they actually *use* a different definition: for example, the chart may say that 0x80 is undefined, but actually map it to U+0080.
  c) The RFC itself does not settle important issues of identity among charsets. If a new mapping is added to a charset converter, is that a different charset (and thus needs a different registration) or not? Does that go for any superset? etc. We've raised these issues before, but with no resolution (or even attempt at one) Cf. http://dev.icu-project.org/cgi-bin/viewcvs.cgi/*checkout*/icuhtml/design/charset_questions.html

As a product of the above problems, the actual results obtained by using the iana charset names on any given platform* may vary wildly. For example, among the iana-registry-named charsets, there were over a million different mapping differences between Sun's and IBM's Java, total.

* "platform" speaking broadly -- ithe results may vary by OS (Mac vs Windows vs Linux...), by programming language [Java) or by version of programming language runtime (IBM vs Sun's Java), or even by product (database version).

In ICU, for example, our requirement was to be able to reproduce the actual, observeable, character conversions in effect on any platform. With that goal, we basically had to give up trying to use the IANA registry at all. We compose mappings by scraping; calling the APIs on those platforms to do conversions and collecting the results, and providing a different internal identifier for any differing mapping. We then have a separate name mapping that goes from each platform's name (the name according to that platform) for each character to the unique identifier. Cf. http://icu.sourceforge.net/charts/charset/.

And based on work here at Google, it is pretty clear that -- at least in terms of web pages -- little reliance can be placed on the charset information. As imprecise as heuristic charset detection is, it is more accurate than relying on the charset tags in the html meta element (and what is in the html meta element is more accurate than what is communicated by the http protocol).

So while I applaud your goal, I would suspect that that it would be a huge amount of effort for very little return.

Mark



> I agree that we've reached a point where "use UTF-8" is what we need to be
> pushing for in new protocol development. (Note that I said UTF-8 and not
> Unicode - given the existance of gb18030 [*] I don't regard a recommendation of
> "use Unicode" as even close to sufficient. The last thing we want is to see the
> development of specializesd Unicode CESes for Korean, Japanese, Arabic, Hebrew,
> Thai, and who knows what else.) And if the reason there are new charset
> registrations was because of the perceived need to have new charsets for use in
> new protocols, I would be in total agreement that a change in focus for charset
> registration is in order.
>
> But that's not why we're seeing new registrations. The new registrations we're
> seeing are of legacy charsets used in legacy applications and protocols that
> for whatever reason never got registered previously. Given that these things
> are in use in various nooks and crannies around the world, it is critically
> important that when they are used they are labelled accurately and
> consistently.
>
> The plain fact of the matter is that we have done a miserable job of producing
> an accurate and useful charset registry, and considerable work needs to be done
> both to register various missing charsets as well as to clean up the existing
> registry, which contains many errors. I've seen no interest whatsoever in
> registering new charsets for new protocols, so to my mind pushing back on, say,
> the recent registration of iso-8859-11, is an overreaction to a non-problem.
> [**]
>
> > This question is motivated, not by a strong love for Unicode,
> > but by the observation that RFC 2277 requires it and that the
> > IETF is shifting toward it in a number of areas.   More options
> > and possibilities for local codings that are not generally known
> > and supported do not help with interoperability; perhaps it is
> > time to start pushing back.
>
> Well, I have to say that to the extent we've pushed back on registrations, what
> we've ended up with is ad-hoc mess of unregistered usage. I am therefore quite
> skeptical of any belief that pushing back on registrations is a useful tactic.
>
> > And that, of course, would dramatically change the work of the
> > charset reviewer by reducing the volume but increasing the
> > amount of evaluation to be done.
>
> Even if we closed the registry completely there is still a bunch of work to do
> in terms of registry cleanup.
>
> Now, having said all this, I'm willing to take on the role of charset reviewer,
> but with the understanding that one of the things I will do is conduct a
> complete overhaul of the existing registry. [***] Such a substantive change will
> of course require some degree of oversight, which in turn means I'd like to see
> some commitment from the IESG of support for the effort.
>
> As for qualifications, I did write the charset registration specification, and
> I also wrote and continue to maintain a fairly full-features charset conversion
> library. I can provide more detail if anyone cares.
>
>                                 Ned
>
> [*] - For those not fully up to speed on this stuff, gb18030 can be seen as an
> encoding of Unicode that is backwards compatible with the previous simplified
> Chinese charsets gb2312 and gbk.
>
> [**] - The less recent attempt to register ISO-2022-JP-2004 is a more
> interesting case. I believe this one needed to be pushed on, but not
> because of potential use in new applications or protocols.
>
> [***] - I have the advantage of being close enough to IANA that I can drive
> over there and have F2F meetings should the need arise - and I suspect
> it will.
>