[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Problems (and non-problems) in charset registry (was: Re: Volunteerneeded to serve as IANA charset reviewer)
Hello Mark, others,
I think it's good to have such a collection of problems in the registry.
But I think it's also fair to say that what Mark lists as problems may
not in all cases actually be problems.
At 08:44 06/09/07, Mark Davis wrote:
>If the registry provided an unambiguous, stable definition of each charset identifier in terms of an explicit, available mapping to Unicode/10646 (whether the UTF-8 form of Unicode or the UTF-32 code points -- that is just a difference in format, not content), it would indeed be useful. However, I suspect quite strongly that it is a futile task. There are a number of problems with the current registry.
I think the request for an explicit, fixed mapping is a good one,
but in some cases, it would come at a high cost. A typical example
is Shift_JIS: We know there are many variants, on the one hand due
to additions made by makers (or even privately), on the other hand
also due to some changes in the underlying standard (which go back
to 1983).
For an example like Shift_JIS, the question becomes whether
we want to use a single label, or whether we want to carefully
label each variant.
Benefits for a single label:
- Better chance that the recipient knows the label and can do
something with it.
- Better chance to teach people about different encodings
(it's possible to teach people the difference between Shift_JIS,
EUC-JP, and UTF-8, but it would be close to impossible to
teach them about all the various variants)
- No 'overlabeling' (being more precise than necessary, for the
cases (the huge majority) where in the actual data, there is
no difference).
- Usually enough for visual decoding (reading of emails and
Web pages by humans)
- Not influenced by issues outside actual character encoding
(e.g. error treatment of APIs)
Benefits for detailled labeling:
- Accurate data transmission for all data, even fringe cases
- True round-trips for a wider range of scenarios
- May be better suited for machine-to-machine processing
>2. Incomplete (more important)
>There are many charsets (such as some windows charsets) that are not in the registry, but that are in *far* more widespread use than the majority of the charsets in the registry. Attempted registrations have just been left hanging, cf <http://www.google.com/url?sa=D&q=http%3A%2F%2Fmail.apps.ietf.org%2Fietf%2Fcharsets%2Fmsg01510.html>http://mail.apps.ietf.org/ietf/charsets/msg01510.html<http://www.google.com/url?sa=D&q=http%3A%2F%2Fmail.apps.ietf.org%2Fietf%2Fcharsets%2Fmsg01510.html>
Some of this is due to the original rather strict management of the registry.
Some of it is due to the current backlog. A lot of this is due to the fact
that nobody cares enough to work through the registration process; many
people think that 'the IETF' or 'the IANA' will just do it. The solution
is easy: Don't complain, register.
>2. Ill-defined registrations (crucial)
> a) There are registered names that have useless (inaccessable or unstable) references; there is no practical way to figure out what the charset definition is.
This is possible; some of these registrations are probably irrelevant,
for others, it's pretty clear what the charset means in general, even
though there might be implementation differences for some codepoints.
> b) There are other registrations that are defined by reference to an available chart, but when you actually test what the vendor's APIs map to, they actually *use* a different definition: for example, the chart may say that 0x80 is undefined, but actually map it to U+0080.
It's clear that for really faulty charts, the vendors should be blamed,
and not the registry.
However, the difference between the published map and the
actually used API may be due to the fact that 0x80 is indeed not
part of the encoding as formally defined, and is mapped to U+0080
just as part of error treatment. For most applications (not for
all necessarily), it would be a mistake to include error processing
in the formal definition of an encoding.
> c) The RFC itself does not settle important issues of identity among charsets. If a new mapping is added to a charset converter, is that a different charset (and thus needs a different registration) or not? Does that go for any superset? etc. We've raised these issues before, but with no resolution (or even attempt at one) Cf. <http://www.google.com/url?sa=D&q=http%3A%2F%2Fdev.icu-project.org%2Fcgi-bin%2Fviewcvs.cgi%2F*checkout*%2Ficuhtml%2Fdesign%2Fcharset_questions.html>http://dev.icu-project.org/cgi-bin/viewcvs.cgi/*checkout*/icuhtml/design/charset_questions.<http://www.google.com/url?sa=D&q=http%3A%2F%2Fdev.icu-project.org%2Fcgi-bin%2Fviewcvs.cgi%2F*checkout*%2Ficuhtml%2Fdesign%2Fcharset_questions.html>html
It seems that what you would want, for your purposes, is to use a new
label if a new character gets added to a legacy encoding, but not use
a new label e.g. for UTF-8 each time a character gets added.
So things would be somewhat case-by-case.
>As a product of the above problems, the actual results obtained by using the iana charset names on any given platform* may vary wildly. For example, among the iana-registry-named charsets, there were over a million different mapping differences between Sun's and IBM's Java, total.
It would be better to express these numbers in terms of percentages:
In the experiment made, how many codepoints were mapped, and for how many
did you get differences?
Even better would be to express these numbers in terms of percentages
of actual average data. My guess is that this would be much lower than
the percentage of code points.
This is in no way to belittle the problems, just to make sure we put
it in proportions.
>In ICU, for example, our requirement was to be able to reproduce the actual, observeable, character conversions in effect on any platform. With that goal, we basically had to give up trying to use the IANA registry at all.
This is understandable. The start of the IANA registry was MIME, i.e.
email. The goal was to be able to (ultimately visually) decode the
characters at the other end.
>We compose mappings by scraping; calling the APIs on those platforms to do conversions and collecting the results, and providing a different internal identifier for any differing mapping. We then have a separate name mapping that goes from each platform's name (the name according to that platform) for each character to the unique identifier. Cf. <http://icu.sourceforge.net/charts/charset/>http://icu.sourceforge.net/charts/charset/.
This is very thorough, and may look foolproof, but isn't.
One issue already mentioned is error behavior. If a first API
maps an undefined character to some codepoint, and a second
API maps it to another codepoint (or sequence), they just
made different decision for error behavior (e.g. mapping
unknown codepoints to '?' or to one of the substitution
characters, or dropping them,...). This would be particularly
prominent when converting from Unicode to a legacy encoding,
because in this case there are tons of codepoints that
can't be converted. But this most probably should not
be part of the definition of a 'charset'.
Also, There are cases where there are no differences in transcoding,
but font differences. Examples are the treatment of the backslash
character on MS Windows systems (shown as a Yen symbol because
most (Unicode!) fonts on Japanese Windows systems have it that
way, or certain cases where the traditional and (Japanese) simplified
variant of a character got exchanged when moving from the 1978 to
the 1983 version of the standard.
>And based on work here at Google, it is pretty clear that -- at least in terms of web pages -- little reliance can be placed on the charset information.
Yes, but this is a problem on a diffent level than the above.
Above, you are speaking about small variant differences.
Here, you are speaking about completely mislabled pages.
The problems with small variants don't make the labels in
the registry unsuitable for labeling Web pages. Most Web
pages don't contain characters where the minor version
differences matter, because private use and corporate
characters don't work well on the Web, and because some
of the transcoding differences are between e.g. full-width
and half-width variants, which are mostly irrelevant for
viewing and in particular for search.
>As imprecise as heuristic charset detection is, it is more accurate than relying on the charset tags in the html meta element (and what is in the html meta element is more accurate than what is communicated by the http protocol).
This sounds reasonable.
>So while I applaud your goal, I would suspect that that it would be a huge amount of effort for very little return.
There are other protocols, in particular email. My understanding is
that for email, the situation is quite a bit better, because people
use a dedicated tool (their MUA) to write emails, and emails rarely
get transcoded on the character level, and there is no server involved,
whereas users use whatever they can get their hands on to create
Web pages, the server can mess things up (or help, in some cases),
and pages may get transcoded.
A reasonable conclusion from the above is that no size fits all.
Many applications may be very well served with the granularity
of labels we have now. Some others may need more granularity.
We would either have to decide that we stay with the current
granularity, or that we maybe move to a system with multiple
levels of granularity.
Regards, Martin.
#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp