[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Best fit
There are Web pages out there that use \xA2\xCC in Big5, and there is
at least one implementation out there that does not include this in
its Unicode mapping table. So you end up with garbled text, e.g. a '?'
question mark or missing glyph symbol, looking out of place in the
middle of Chinese text. If both mappings had been specified in the
table at the time that the implementation was created, then this
problem would not have occurred. Of course, it would have been better
if only one of the Big5 encodings of that character were in use, but
this is, in fact, not the case.
Erik
On 10/23/06, Martin Duerst <duerst@it.aoyama.ac.jp> wrote:
> At 04:58 06/10/23, Erik van der Poel wrote:
> >I have come across an interoperability problem
>
> Can you better explain what exactly the interoperability
> problem is/how it will be solved by including non-round-trip
> mappings?
>
> Regards, Martin.
>
> >where one
> >implementation supports two mappings to a particular 10646 codepoint
> >and another implementation only supports one of those mappings, in the
> >Big5 charset:
> >
> >\xA4\x51 <-> U+5341 (reverse mapping from MS best fit table)
> >\xA2\xCC -> U+5341
> >
> >(\x introduces a Big5 byte in hex, U+ introduces a 10646 codepoint in hex)
> >
> >http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit950.txt
> >http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP950.TXT
> >http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/CHINTRAD.TXT
> >http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/BIG5.TXT
> >
> >So I don't think it will be sufficient to include only the
> >round-trippable mappings. Now, if we include non-round-trip mappings,
> >we will probably have to indicate which mapping to use when converting
> >in the other direction (from 10646). This can be done in at least 2
> >different ways: mark one of the mappings in the "to 10646" table as
> >the one to use in the other direction, or provide a full "from 10646"
> >table, with or without best fit mappings depending on the outcome of
> >this discussion.
> >
> >Erik
> >
> >On 10/22/06, Mark Davis <mark.davis@icu-project.org> wrote:
> >> I agree I think it would be far more straightforward and well-defined if all
> >> non-roundtrip mappings were excluded from the registrations.
> >>
> >> Mark
> >>
> >>
> >> On 10/22/06, Erik van der Poel <erikv@google.com> wrote:
> >> > I have to admit that Kent does make an important point here. The
> >> > example that really drives that point home is
> >> > Windows-1252-johndoesbetterfit. The best fit tables provided by
> >> > Microsoft are their own choices for mappings from the very large
> >> > Unicode set to smaller sets. Other implementors could and do come up
> >> > with other choices, depending on their particular product, target
> >> > market and current compatibility considerations.
> >> >
> >> > The most important mapping, in my view, is the one from the charset to
> >> > Unicode/10646. RFC 2978 is actually a little bit inconsistent here, in
> >> > that it mentions mappings to 10646 twice, and to/from 10646 only once.
> >> > Just look for "10646" and you will see what I mean.
> >> >
> >> > I believe my attempt to assist in the windows-1252 registration update
> >> > has revealed a lack of consensus (albeit among a very small number of
> >> > participants) regarding the "best fit" mappings. I wonder if we should
> >> > even restrict the normative/recommended 10646 mappings to the "to
> >> > 10646" mappings, making any supplied "from10646" mappings either
> >> > purely informative or maybe even unrecommended, since they appear to
> >> > be controversial.
> >> >
> >> > Erik
> >> >
> >> > On 10/22/06, Kent Karlsson <kent.karlsson14@comhem.se> wrote:
> >> > >
> >> > > Frank Ellermann wrote:
> >> > > > > ICU may have chosen 0x1A, but that was their own decision. There is
> >> > > > > no interoperability problem here
> >> > > >
> >> > > > An u2w.icu( x ) != u2w.bestfit( x ) effect could be ugly. For some
> >> > >
> >> > > As I said, the fallbacks do not belong in the registration. It should be
> >> > > perfectly ok to use other fallbacks. E.g. generating higher level
> >> > > markup,
> >> > > be it character escapes or more [like <sup>...</sup> for instance, or
> >> > > <span class="red">...</span>], or some "this-is-even-better-fit".
> >> > >
> >> > > The fallbacks ("bestfit") of the "bestfit" file should *NOT* be part of
> >> > > the IANA charset registration!
> >> > >
> >> > > > code pages like < http://purl.net/net/cp/858> ICU tries hard to list
> >> > > > an "official" substitution character, in that case 0x7F, not 0x1A.
> >> > >
> >> > > As I mentioned, the ICU API allows the programmer quite a lot of control
> >> > > on how to handle conversion errors. One can set it up to automatically
> >> > > generate XML-ish or Java-ish escapes (which I prefer, even if not
> >> > > targeting
> >> > > XML or Java), or to use another "error" character (I would *never*
> >> > > choose '?'
> >> > > for that). One can set up ones own callback function for conversion
> >> > > errors.
> >> > >
> >> > > > > Should we strip the best fit mappings from the table and post it
> >> > > > > somewhere?
> >> > >
> >> > > There's one already.
> >> > >
> >> > > > They're fine, but could be improved by adding a hint how they were
> >> > > > determined, and who could fix them if needed.
> >> > >
> >> > > The "bestit" one should NOT be used for the registration. It could be
> >> > > seen as making any "better" converters (e.g. generating XML escapes)
> >> > > "non-conforming" (each requiring a different charset registration;
> >> > > 'Windows-1252-XMLescapes',
> >> 'Windows-1252-XMLescapes-boldnredCSS',
> >> > >
> >> 'Windows-1252-XMLescapes-boldnredCSS-butSUPforsuperscripts',
> >> > > 'Windows-1252-johndoesbetterfit', ...). I hope you
> >> don't want that.
> >> > >
> >> > > /kent k
> >> > >
> >> > >
> >> >
> >>
> >>
>
>
> #-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
> #-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp
>
>
- Follow-Ups:
- RE: Best fit
- From: Kent Karlsson <kent.karlsson14@comhem.se>