[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Best fit



It is quite clear that an implementor must take non-round-trippable
mappings from the charset to 10646 into account, in order to
interoperate with other implementations. It may not be as clear that
IANA should take these into account, but it would be in the spirit of
the IETF, since that organization is quite concerned about
interoperability.

Now, in the other direction, i.e. from 10646 to the charset, re:
"u2w.icu( x ) != u2w.bestfit( x )", it may not be so easy to come up
with scenarios where an implementor would have to mimic another
implementation. One contrived scenario might be a kind of gateway that
converts from a Unicode-based encoding to windows-*, using
Microsoft-chosen best fit mappings. If another implementor wanted to
replace that gateway, they would have to know the best fit mappings.

If noone can come up with realistic scenarios requiring the knowledge
of best fit mappings in the "from 10646" direction, and if there is
consensus that the non-round-trippable mappings in the other direction
(to 10646) are important, perhaps it is time to discuss updating RFC
2978 (or adding another erratum).

http://www.rfc-editor.org/cgi-bin/errataSearch.pl?rfc=2978&;

Erik

On 10/23/06, Kent Karlsson <kent.karlsson14@comhem.se> wrote:
>
> http://dev.icu-project.org/cgi-bin/viewcvs.cgi/charset/data/ucm/glibc-BI
> G5-2.3.3.ucm?revision=1.1&view=markup
>
> gives 10 similar cases. They are listed as "|3 for
> the best reverse fallback Unicode scaler [sic] value".
>
> I think such cases may well be included in IANA charset
> registry (referred) mapping tables, as they represent a
> character "more equivalent than canonically equivalent"
> to the character it is mapped to.
>
> They are not fallbacks in the sense I referred to previously;
> the latter would be "|1 for the best fallback codepage byte
> sequence", with a large question mark for the "best" part.
> (None are given in the glibc-BIG5-2.3.3.ucm file.)
>
> (I haven't scanned the 1003 other ucm files in
> http://dev.icu-project.org/cgi-bin/viewcvs.cgi/charset/data/ucm/.)
>
>         /kent k
>
> > -----Original Message-----
> > From: Erik van der Poel [mailto:erikv@google.com]
> > Sent: Monday, October 23, 2006 4:27 PM
> > To: Martin Duerst
> > Cc: Mark Davis; Kent Karlsson; Frank Ellermann;
> > ietf-charsets@mail.apps.ietf.org
> > Subject: Re: Best fit
> >
> >
> > There are Web pages out there that use \xA2\xCC in Big5, and there is
> > at least one implementation out there that does not include this in
> > its Unicode mapping table. So you end up with garbled text, e.g. a '?'
> > question mark or missing glyph symbol, looking out of place in the
> > middle of Chinese text. If both mappings had been specified in the
> > table at the time that the implementation was created, then this
> > problem would not have occurred. Of course, it would have been better
> > if only one of the Big5 encodings of that character were in use, but
> > this is, in fact, not the case.
> >
> > Erik
> >
> > On 10/23/06, Martin Duerst <duerst@it.aoyama.ac.jp> wrote:
> > > At 04:58 06/10/23, Erik van der Poel wrote:
> > > >I have come across an interoperability problem
> > >
> > > Can you better explain what exactly the interoperability
> > > problem is/how it will be solved by including non-round-trip
> > > mappings?
> > >
> > > Regards,     Martin.
> > >
> > > >where one
> > > >implementation supports two mappings to a particular 10646
> > codepoint
> > > >and another implementation only supports one of those
> > mappings, in the
> > > >Big5 charset:
> > > >
> > > >\xA4\x51 <-> U+5341 (reverse mapping from MS best fit table)
> > > >\xA2\xCC -> U+5341
> > > >
> > > >(\x introduces a Big5 byte in hex, U+ introduces a 10646
> > codepoint in hex)
> > > >
> > >
> > >http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/Windows
> > BestFit/bestfit950.txt
> > >
> > >http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS
> > /CP950.TXT
> > > >http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/CHINTRAD.TXT
> > >
> > >http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHE
> > R/BIG5.TXT
> > > >
> > > >So I don't think it will be sufficient to include only the
> > > >round-trippable mappings. Now, if we include
> > non-round-trip mappings,
> > > >we will probably have to indicate which mapping to use
> > when converting
> > > >in the other direction (from 10646). This can be done in at least 2
> > > >different ways: mark one of the mappings in the "to 10646" table as
> > > >the one to use in the other direction, or provide a full
> > "from 10646"
> > > >table, with or without best fit mappings depending on the
> > outcome of
> > > >this discussion.
> > > >
> > > >Erik
> > > >
> > > >On 10/22/06, Mark Davis <mark.davis@icu-project.org> wrote:
> > > >> I agree I think it would be far more straightforward and
> > well-defined if all
> > > >> non-roundtrip mappings were excluded from the registrations.
> > > >>
> > > >> Mark
> > > >>
> > > >>
> > > >> On 10/22/06, Erik van der Poel <erikv@google.com> wrote:
> > > >> > I have to admit that Kent does make an important point
> > here. The
> > > >> > example that really drives that point home is
> > > >> > Windows-1252-johndoesbetterfit. The best fit tables provided by
> > > >> > Microsoft are their own choices for mappings from the
> > very large
> > > >> > Unicode set to smaller sets. Other implementors could
> > and do come up
> > > >> > with other choices, depending on their particular
> > product, target
> > > >> > market and current compatibility considerations.
> > > >> >
> > > >> > The most important mapping, in my view, is the one
> > from the charset to
> > > >> > Unicode/10646. RFC 2978 is actually a little bit
> > inconsistent here, in
> > > >> > that it mentions mappings to 10646 twice, and to/from
> > 10646 only once.
> > > >> > Just look for "10646" and you will see what I mean.
> > > >> >
> > > >> > I believe my attempt to assist in the windows-1252
> > registration update
> > > >> > has revealed a lack of consensus (albeit among a very
> > small number of
> > > >> > participants) regarding the "best fit" mappings. I
> > wonder if we should
> > > >> > even restrict the normative/recommended 10646 mappings
> > to the "to
> > > >> > 10646" mappings, making any supplied "from10646"
> > mappings either
> > > >> > purely informative or maybe even unrecommended, since
> > they appear to
> > > >> > be controversial.
> > > >> >
> > > >> > Erik
> > > >> >
> > > >> > On 10/22/06, Kent Karlsson <kent.karlsson14@comhem.se> wrote:
> > > >> > >
> > > >> > > Frank Ellermann wrote:
> > > >> > > > > ICU may have chosen 0x1A, but that was their own
> > decision. There is
> > > >> > > > > no interoperability problem here
> > > >> > > >
> > > >> > > > An u2w.icu( x ) != u2w.bestfit( x ) effect could
> > be ugly.  For some
> > > >> > >
> > > >> > > As I said, the fallbacks do not belong in the
> > registration. It should be
> > > >> > > perfectly ok to use other fallbacks. E.g. generating
> > higher level
> > > >> > > markup,
> > > >> > > be it character escapes or more [like <sup>...</sup>
> > for instance, or
> > > >> > > <span class="red">...</span>], or some
> > "this-is-even-better-fit".
> > > >> > >
> > > >> > > The fallbacks ("bestfit") of the "bestfit" file
> > should *NOT* be part of
> > > >> > > the IANA charset registration!
> > > >> > >
> > > >> > > > code pages like < http://purl.net/net/cp/858> ICU
> > tries hard to list
> > > >> > > > an "official" substitution character, in that case
> > 0x7F, not 0x1A.
> > > >> > >
> > > >> > > As I mentioned, the ICU API allows the programmer
> > quite a lot of control
> > > >> > > on how to handle conversion errors. One can set it
> > up to automatically
> > > >> > > generate XML-ish or Java-ish escapes (which I
> > prefer, even if not
> > > >> > > targeting
> > > >> > > XML or Java), or to use another "error" character (I
> > would *never*
> > > >> > > choose '?'
> > > >> > > for that). One can set up ones own callback function
> > for conversion
> > > >> > > errors.
> > > >> > >
> > > >> > > > > Should we strip the best fit mappings from the
> > table and post it
> > > >> > > > > somewhere?
> > > >> > >
> > > >> > > There's one already.
> > > >> > >
> > > >> > > > They're fine, but could be improved by adding a
> > hint how they were
> > > >> > > > determined, and who could fix them if needed.
> > > >> > >
> > > >> > > The "bestit" one should NOT be used for the
> > registration. It could be
> > > >> > > seen as making any "better" converters (e.g.
> > generating XML escapes)
> > > >> > > "non-conforming" (each requiring a different charset
> > registration;
> > > >> > > 'Windows-1252-XMLescapes',
> > > >> 'Windows-1252-XMLescapes-boldnredCSS',
> > > >> > >
> > > >> 'Windows-1252-XMLescapes-boldnredCSS-butSUPforsuperscripts',
> > > >> > > 'Windows-1252-johndoesbetterfit', ...). I hope you
> > > >> don't want that.
> > > >> > >
> > > >> > >                 /kent k
> > > >> > >
> > > >> > >
> > > >> >
> > > >>
> > > >>
> > >
> > >
> > > #-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
> > > #-#-#  http://www.sw.it.aoyama.ac.jp
> > mailto:duerst@it.aoyama.ac.jp
> > >
> > >
> >
>
>