[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Best fit




http://dev.icu-project.org/cgi-bin/viewcvs.cgi/charset/data/ucm/glibc-BI
G5-2.3.3.ucm?revision=1.1&view=markup

gives 10 similar cases. They are listed as "|3 for 
the best reverse fallback Unicode scaler [sic] value".

I think such cases may well be included in IANA charset
registry (referred) mapping tables, as they represent a
character "more equivalent than canonically equivalent"
to the character it is mapped to.

They are not fallbacks in the sense I referred to previously;
the latter would be "|1 for the best fallback codepage byte
sequence", with a large question mark for the "best" part.
(None are given in the glibc-BIG5-2.3.3.ucm file.)

(I haven't scanned the 1003 other ucm files in
http://dev.icu-project.org/cgi-bin/viewcvs.cgi/charset/data/ucm/.)

	/kent k

> -----Original Message-----
> From: Erik van der Poel [mailto:erikv@google.com] 
> Sent: Monday, October 23, 2006 4:27 PM
> To: Martin Duerst
> Cc: Mark Davis; Kent Karlsson; Frank Ellermann; 
> ietf-charsets@mail.apps.ietf.org
> Subject: Re: Best fit
> 
> 
> There are Web pages out there that use \xA2\xCC in Big5, and there is
> at least one implementation out there that does not include this in
> its Unicode mapping table. So you end up with garbled text, e.g. a '?'
> question mark or missing glyph symbol, looking out of place in the
> middle of Chinese text. If both mappings had been specified in the
> table at the time that the implementation was created, then this
> problem would not have occurred. Of course, it would have been better
> if only one of the Big5 encodings of that character were in use, but
> this is, in fact, not the case.
> 
> Erik
> 
> On 10/23/06, Martin Duerst <duerst@it.aoyama.ac.jp> wrote:
> > At 04:58 06/10/23, Erik van der Poel wrote:
> > >I have come across an interoperability problem
> >
> > Can you better explain what exactly the interoperability
> > problem is/how it will be solved by including non-round-trip
> > mappings?
> >
> > Regards,     Martin.
> >
> > >where one
> > >implementation supports two mappings to a particular 10646 
> codepoint
> > >and another implementation only supports one of those 
> mappings, in the
> > >Big5 charset:
> > >
> > >\xA4\x51 <-> U+5341 (reverse mapping from MS best fit table)
> > >\xA2\xCC -> U+5341
> > >
> > >(\x introduces a Big5 byte in hex, U+ introduces a 10646 
> codepoint in hex)
> > >
> > 
> >http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/Windows
> BestFit/bestfit950.txt
> > 
> >http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS
> /CP950.TXT
> > >http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/CHINTRAD.TXT
> > 
> >http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHE
> R/BIG5.TXT
> > >
> > >So I don't think it will be sufficient to include only the
> > >round-trippable mappings. Now, if we include 
> non-round-trip mappings,
> > >we will probably have to indicate which mapping to use 
> when converting
> > >in the other direction (from 10646). This can be done in at least 2
> > >different ways: mark one of the mappings in the "to 10646" table as
> > >the one to use in the other direction, or provide a full 
> "from 10646"
> > >table, with or without best fit mappings depending on the 
> outcome of
> > >this discussion.
> > >
> > >Erik
> > >
> > >On 10/22/06, Mark Davis <mark.davis@icu-project.org> wrote:
> > >> I agree I think it would be far more straightforward and 
> well-defined if all
> > >> non-roundtrip mappings were excluded from the registrations.
> > >>
> > >> Mark
> > >>
> > >>
> > >> On 10/22/06, Erik van der Poel <erikv@google.com> wrote:
> > >> > I have to admit that Kent does make an important point 
> here. The
> > >> > example that really drives that point home is
> > >> > Windows-1252-johndoesbetterfit. The best fit tables provided by
> > >> > Microsoft are their own choices for mappings from the 
> very large
> > >> > Unicode set to smaller sets. Other implementors could 
> and do come up
> > >> > with other choices, depending on their particular 
> product, target
> > >> > market and current compatibility considerations.
> > >> >
> > >> > The most important mapping, in my view, is the one 
> from the charset to
> > >> > Unicode/10646. RFC 2978 is actually a little bit 
> inconsistent here, in
> > >> > that it mentions mappings to 10646 twice, and to/from 
> 10646 only once.
> > >> > Just look for "10646" and you will see what I mean.
> > >> >
> > >> > I believe my attempt to assist in the windows-1252 
> registration update
> > >> > has revealed a lack of consensus (albeit among a very 
> small number of
> > >> > participants) regarding the "best fit" mappings. I 
> wonder if we should
> > >> > even restrict the normative/recommended 10646 mappings 
> to the "to
> > >> > 10646" mappings, making any supplied "from10646" 
> mappings either
> > >> > purely informative or maybe even unrecommended, since 
> they appear to
> > >> > be controversial.
> > >> >
> > >> > Erik
> > >> >
> > >> > On 10/22/06, Kent Karlsson <kent.karlsson14@comhem.se> wrote:
> > >> > >
> > >> > > Frank Ellermann wrote:
> > >> > > > > ICU may have chosen 0x1A, but that was their own 
> decision. There is
> > >> > > > > no interoperability problem here
> > >> > > >
> > >> > > > An u2w.icu( x ) != u2w.bestfit( x ) effect could 
> be ugly.  For some
> > >> > >
> > >> > > As I said, the fallbacks do not belong in the 
> registration. It should be
> > >> > > perfectly ok to use other fallbacks. E.g. generating 
> higher level
> > >> > > markup,
> > >> > > be it character escapes or more [like <sup>...</sup> 
> for instance, or
> > >> > > <span class="red">...</span>], or some 
> "this-is-even-better-fit".
> > >> > >
> > >> > > The fallbacks ("bestfit") of the "bestfit" file 
> should *NOT* be part of
> > >> > > the IANA charset registration!
> > >> > >
> > >> > > > code pages like < http://purl.net/net/cp/858> ICU 
> tries hard to list
> > >> > > > an "official" substitution character, in that case 
> 0x7F, not 0x1A.
> > >> > >
> > >> > > As I mentioned, the ICU API allows the programmer 
> quite a lot of control
> > >> > > on how to handle conversion errors. One can set it 
> up to automatically
> > >> > > generate XML-ish or Java-ish escapes (which I 
> prefer, even if not
> > >> > > targeting
> > >> > > XML or Java), or to use another "error" character (I 
> would *never*
> > >> > > choose '?'
> > >> > > for that). One can set up ones own callback function 
> for conversion
> > >> > > errors.
> > >> > >
> > >> > > > > Should we strip the best fit mappings from the 
> table and post it
> > >> > > > > somewhere?
> > >> > >
> > >> > > There's one already.
> > >> > >
> > >> > > > They're fine, but could be improved by adding a 
> hint how they were
> > >> > > > determined, and who could fix them if needed.
> > >> > >
> > >> > > The "bestit" one should NOT be used for the 
> registration. It could be
> > >> > > seen as making any "better" converters (e.g. 
> generating XML escapes)
> > >> > > "non-conforming" (each requiring a different charset 
> registration;
> > >> > > 'Windows-1252-XMLescapes',
> > >> 'Windows-1252-XMLescapes-boldnredCSS',
> > >> > >
> > >> 'Windows-1252-XMLescapes-boldnredCSS-butSUPforsuperscripts',
> > >> > > 'Windows-1252-johndoesbetterfit', ...). I hope you
> > >> don't want that.
> > >> > >
> > >> > >                 /kent k
> > >> > >
> > >> > >
> > >> >
> > >>
> > >>
> >
> >
> > #-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
> > #-#-#  http://www.sw.it.aoyama.ac.jp       
> mailto:duerst@it.aoyama.ac.jp
> >
> >
>