[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Problems (and non-problems) in charset registry



Martin Duerst wrote:

> I think it's also fair to say that what Mark lists as
> problems may not in all cases actually be problems.

Anything he said can be done with CharMapML.  I can't judge
your Shift_JIS example, but for windows-1252 it's pretty
obvious:

Probably the owners want to be free to do whatever plases
them with the remaining five unassigned code points if
necessary.  CharMapML can express that with its versions.

And "we" know that it uses 1:1 mappings in practice for these
five code points, CharMapML could express that as fallback
mappings.

If the most recent mapping can use the preferred MIME name
it's good enough.  Purists worried about persistence will
please convert it to one of the less obscure UTFs, and store
the document in that form.

> It's clear that for really faulty charts, the vendors should
> be blamed, and not the registry.

Nothing's wrong with the charts, there are two legits aspects,
one is "we'll continue to call it windows-1252 even if we
assign one of the five codepoints" (if that happens anytime
soon before windows-1252 is history like all legacy charsets).

Until then the other aspect is "just use u+0081 for 0x81",
because it's KISS.  If you insist on it throw an error, the
user can then decide if it's completely mislabelled.

> For most applications (not for all necessarily), it would
> be a mistake to include error processing in the formal
> definition of an encoding.

If I'd see a perfectly legal u+0080 in Latin-1 I'd guess that
this must be an error, probably the document is windows-1252.
Today "claims to be Latin-1" is as convincing as "claims to
be ASCII" when RFC 1341 was written.  Nothing to do for the
registry, it offers windows-1252 for those who want to get it
right.

Maybe a good reason why registering 8859-11 but not 874 isn't
ideal, we're not interested in folks using 8859-11 if what
they really mean is windows-874.

> It seems that what you would want, for your purposes, is to
> use a new label if a new character gets added to a legacy
> encoding

That's a very clean solution with its own drawbacks, my box
insists on saying 1004 for windows-1252, my browser claims
erroneously that this is Latin-1, and my box says 850 when it
means 858.  That's my local business, I know where to fix it
before it hits others.

> not use a new label e.g. for UTF-8 each time a character gets
> added.  So things would be somewhat case-by-case.

Yes.  For BOCU-1, SCSU, and the two UTF-*BE we don't need a
mapping.  We also don't "need" it for UTF-1, UTF-8, and the
two UTF-*LE, but it's possible, e.g. about 1100 (long) lines
for UTF-16LE, or 3200 folded lines in a CharMapML mapping.

I like its <range> element - took me some time to understand
it, UTF-8 in 48 folded lines is really nice.

 [0x1A]
> This would be particularly prominent when converting from
> Unicode to a legacy encoding, because in this case there are
> tons of codepoints that can't be converted. But this most
> probably should not be part of the definition of a 'charset'.

CharMapML offers to specify a legacy SUB.  Applications can
offer to use another legacy character like 0x7F or '?', or
throw an error.  Or <shudder> silently drop it </shudder> -
but that's IMO on the wrong side of the border to "broken".

It's mildly interesting to minimize the reported errors, the
implementation details are irrelevant for the registry.  I'd
like to get "official" mappings for most registered charsets,
and "pull the *.ucn mappings out of ICU, check that they're
okay, and host them at IANA" could make sense.  The registry
format mostly as is, adding URLs of "official" mappings to
checked entries.

Maybe join some entries, csUnicode + UTF-16, csUCS4 + UTF-32,
the works.  Or explain what the difference is supposed to be.

Frank