[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Best fit



I have come across an interoperability problem where one
implementation supports two mappings to a particular 10646 codepoint
and another implementation only supports one of those mappings, in the
Big5 charset:

\xA4\x51 <-> U+5341 (reverse mapping from MS best fit table)
\xA2\xCC -> U+5341

(\x introduces a Big5 byte in hex, U+ introduces a 10646 codepoint in hex)

http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit950.txt
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP950.TXT
http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/CHINTRAD.TXT
http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/BIG5.TXT

So I don't think it will be sufficient to include only the
round-trippable mappings. Now, if we include non-round-trip mappings,
we will probably have to indicate which mapping to use when converting
in the other direction (from 10646). This can be done in at least 2
different ways: mark one of the mappings in the "to 10646" table as
the one to use in the other direction, or provide a full "from 10646"
table, with or without best fit mappings depending on the outcome of
this discussion.

Erik

On 10/22/06, Mark Davis <mark.davis@icu-project.org> wrote:
> I agree I think it would be far more straightforward and well-defined if all
> non-roundtrip mappings were excluded from the registrations.
>
> Mark
>
>
> On 10/22/06, Erik van der Poel <erikv@google.com> wrote:
> > I have to admit that Kent does make an important point here. The
> > example that really drives that point home is
> > Windows-1252-johndoesbetterfit. The best fit tables provided by
> > Microsoft are their own choices for mappings from the very large
> > Unicode set to smaller sets. Other implementors could and do come up
> > with other choices, depending on their particular product, target
> > market and current compatibility considerations.
> >
> > The most important mapping, in my view, is the one from the charset to
> > Unicode/10646. RFC 2978 is actually a little bit inconsistent here, in
> > that it mentions mappings to 10646 twice, and to/from 10646 only once.
> > Just look for "10646" and you will see what I mean.
> >
> > I believe my attempt to assist in the windows-1252 registration update
> > has revealed a lack of consensus (albeit among a very small number of
> > participants) regarding the "best fit" mappings. I wonder if we should
> > even restrict the normative/recommended 10646 mappings to the "to
> > 10646" mappings, making any supplied "from10646" mappings either
> > purely informative or maybe even unrecommended, since they appear to
> > be controversial.
> >
> > Erik
> >
> > On 10/22/06, Kent Karlsson <kent.karlsson14@comhem.se> wrote:
> > >
> > > Frank Ellermann wrote:
> > > > > ICU may have chosen 0x1A, but that was their own decision. There is
> > > > > no interoperability problem here
> > > >
> > > > An u2w.icu( x ) != u2w.bestfit( x ) effect could be ugly.  For some
> > >
> > > As I said, the fallbacks do not belong in the registration. It should be
> > > perfectly ok to use other fallbacks. E.g. generating higher level
> > > markup,
> > > be it character escapes or more [like <sup>...</sup> for instance, or
> > > <span class="red">...</span>], or some "this-is-even-better-fit".
> > >
> > > The fallbacks ("bestfit") of the "bestfit" file should *NOT* be part of
> > > the IANA charset registration!
> > >
> > > > code pages like < http://purl.net/net/cp/858> ICU tries hard to list
> > > > an "official" substitution character, in that case 0x7F, not 0x1A.
> > >
> > > As I mentioned, the ICU API allows the programmer quite a lot of control
> > > on how to handle conversion errors. One can set it up to automatically
> > > generate XML-ish or Java-ish escapes (which I prefer, even if not
> > > targeting
> > > XML or Java), or to use another "error" character (I would *never*
> > > choose '?'
> > > for that). One can set up ones own callback function for conversion
> > > errors.
> > >
> > > > > Should we strip the best fit mappings from the table and post it
> > > > > somewhere?
> > >
> > > There's one already.
> > >
> > > > They're fine, but could be improved by adding a hint how they were
> > > > determined, and who could fix them if needed.
> > >
> > > The "bestit" one should NOT be used for the registration. It could be
> > > seen as making any "better" converters (e.g. generating XML escapes)
> > > "non-conforming" (each requiring a different charset registration;
> > > 'Windows-1252-XMLescapes',
> 'Windows-1252-XMLescapes-boldnredCSS',
> > >
> 'Windows-1252-XMLescapes-boldnredCSS-butSUPforsuperscripts',
> > > 'Windows-1252-johndoesbetterfit', ...). I hope you
> don't want that.
> > >
> > >                 /kent k
> > >
> > >
> >
>
>