[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Best fit



Hello Erik,

I think I agree with you.

I'm sorry I forgot about the BIG5 anomaly of a few instances of
encoding the same character twice. I think in such a case, both
mappings should be listed, and a comment mentioning the anomaly
should be added. The number of such cases, as far as I know, is
very small, both in terms of affected charsets and well as in
terms of affected characters. In most cases, it is due to an
error when designing the charset, in some cases it is due to
a design guideline that differs from Unicode.

I think this is very different from "best fit" fallback mechanisms,
which are really up to the application to invoke (or avoid).
Such mappings should definitely not be listed in equivalence
tables.

Regards,   Martin.

At 02:05 06/10/24, Erik van der Poel wrote:
>It is quite clear that an implementor must take non-round-trippable
>mappings from the charset to 10646 into account, in order to
>interoperate with other implementations. It may not be as clear that
>IANA should take these into account, but it would be in the spirit of
>the IETF, since that organization is quite concerned about
>interoperability.
>
>Now, in the other direction, i.e. from 10646 to the charset, re:
>"u2w.icu( x ) != u2w.bestfit( x )", it may not be so easy to come up
>with scenarios where an implementor would have to mimic another
>implementation. One contrived scenario might be a kind of gateway that
>converts from a Unicode-based encoding to windows-*, using
>Microsoft-chosen best fit mappings. If another implementor wanted to
>replace that gateway, they would have to know the best fit mappings.
>
>If noone can come up with realistic scenarios requiring the knowledge
>of best fit mappings in the "from 10646" direction, and if there is
>consensus that the non-round-trippable mappings in the other direction
>(to 10646) are important, perhaps it is time to discuss updating RFC
>2978 (or adding another erratum).
>
>http://www.rfc-editor.org/cgi-bin/errataSearch.pl?rfc=2978&;
>
>Erik
>
>On 10/23/06, Kent Karlsson <kent.karlsson14@comhem.se> wrote:
>>
>> http://dev.icu-project.org/cgi-bin/viewcvs.cgi/charset/data/ucm/glibc-BI
>> G5-2.3.3.ucm?revision=1.1&view=markup
>>
>> gives 10 similar cases. They are listed as "|3 for
>> the best reverse fallback Unicode scaler [sic] value".
>>
>> I think such cases may well be included in IANA charset
>> registry (referred) mapping tables, as they represent a
>> character "more equivalent than canonically equivalent"
>> to the character it is mapped to.
>>
>> They are not fallbacks in the sense I referred to previously;
>> the latter would be "|1 for the best fallback codepage byte
>> sequence", with a large question mark for the "best" part.
>> (None are given in the glibc-BIG5-2.3.3.ucm file.)
>>
>> (I haven't scanned the 1003 other ucm files in
>> http://dev.icu-project.org/cgi-bin/viewcvs.cgi/charset/data/ucm/.)
>>
>>         /kent k
>>
>> > -----Original Message-----
>> > From: Erik van der Poel [mailto:erikv@google.com]
>> > Sent: Monday, October 23, 2006 4:27 PM
>> > To: Martin Duerst
>> > Cc: Mark Davis; Kent Karlsson; Frank Ellermann;
>> > ietf-charsets@mail.apps.ietf.org
>> > Subject: Re: Best fit
>> >
>> >
>> > There are Web pages out there that use \xA2\xCC in Big5, and there is
>> > at least one implementation out there that does not include this in
>> > its Unicode mapping table. So you end up with garbled text, e.g. a '?'
>> > question mark or missing glyph symbol, looking out of place in the
>> > middle of Chinese text. If both mappings had been specified in the
>> > table at the time that the implementation was created, then this
>> > problem would not have occurred. Of course, it would have been better
>> > if only one of the Big5 encodings of that character were in use, but
>> > this is, in fact, not the case.
>> >
>> > Erik
>> >
>> > On 10/23/06, Martin Duerst <duerst@it.aoyama.ac.jp> wrote:
>> > > At 04:58 06/10/23, Erik van der Poel wrote:
>> > > >I have come across an interoperability problem
>> > >
>> > > Can you better explain what exactly the interoperability
>> > > problem is/how it will be solved by including non-round-trip
>> > > mappings?
>> > >
>> > > Regards,     Martin.
>> > >
>> > > >where one
>> > > >implementation supports two mappings to a particular 10646
>> > codepoint
>> > > >and another implementation only supports one of those
>> > mappings, in the
>> > > >Big5 charset:
>> > > >
>> > > >\xA4\x51 <-> U+5341 (reverse mapping from MS best fit table)
>> > > >\xA2\xCC -> U+5341
>> > > >
>> > > >(\x introduces a Big5 byte in hex, U+ introduces a 10646
>> > codepoint in hex)
>> > > >
>> > >
>> > >http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/Windows
>> > BestFit/bestfit950.txt
>> > >
>> > >http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS
>> > /CP950.TXT
>> > > >http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/CHINTRAD.TXT
>> > >
>> > >http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHE
>> > R/BIG5.TXT
>> > > >
>> > > >So I don't think it will be sufficient to include only the
>> > > >round-trippable mappings. Now, if we include
>> > non-round-trip mappings,
>> > > >we will probably have to indicate which mapping to use
>> > when converting
>> > > >in the other direction (from 10646). This can be done in at least 2
>> > > >different ways: mark one of the mappings in the "to 10646" table as
>> > > >the one to use in the other direction, or provide a full
>> > "from 10646"
>> > > >table, with or without best fit mappings depending on the
>> > outcome of
>> > > >this discussion.
>> > > >



#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst@it.aoyama.ac.jp