[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Best fit

To: Mark Davis <mark.davis@icu-project.org>
Subject: Re: Best fit
From: Erik van der Poel <erikv@google.com>
Date: Sun, 22 Oct 2006 12:58:45 -0700
Cc: Kent Karlsson <kent.karlsson14@comhem.se>,Frank Ellermann <nobody@xyzzy.claranet.de>, ietf-charsets@mail.apps.ietf.org
DomainKey-Signature: a=rsa-sha1; s=beta; d=google.com; c=nofws; q=dns;h=received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references;b=b0TZw8c1/+bMVjHK7XHya8pDnrory/zkNEVKxM0wyRzCWG9yCu+XJuuvb7EweA43vKR/q37m93IZZvSof8D6Tw==
In-reply-to: <30b660a20610221041g404a396eg5cb1f826a58b0157@mail.gmail.com>
List-Id: <ietf-charsets.mail.apps.ietf.org>
List-Owner: <mailto:ietf-charsets-owner@mail.apps.ietf.org>
List-Subscribe: <mailto:mailserv@mail.apps.ietf.org?subject=subscribe%20ietf-charsets>
List-Unsubscribe: <mailto:mailserv@mail.apps.ietf.org?subject=unsubscribe%20ietf-charsets>
Message-hash: 065DCFE447E8A63364F6CB82DB4367C1
Original-recipient: rfc822;ned+ietf-charsets@mrochek.com
References: <453ACC02.1D95@xyzzy.claranet.de><001c01c6f5c9$fcffec20$6500a8c0@chalmers95a69n><c07a32650610220957s3afc4eccr3e628b2d018e9b00@mail.gmail.com><30b660a20610221041g404a396eg5cb1f826a58b0157@mail.gmail.com>
Spam-test: False ; -4.3 / 4.5 ; RCVD_IN_BSP_TRUSTED

I have come across an interoperability problem where one
implementation supports two mappings to a particular 10646 codepoint
and another implementation only supports one of those mappings, in the
Big5 charset:

\xA4\x51 <-> U+5341 (reverse mapping from MS best fit table)
\xA2\xCC -> U+5341

(\x introduces a Big5 byte in hex, U+ introduces a 10646 codepoint in hex)

http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit950.txt
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP950.TXT
http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/CHINTRAD.TXT
http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/BIG5.TXT

So I don't think it will be sufficient to include only the
round-trippable mappings. Now, if we include non-round-trip mappings,
we will probably have to indicate which mapping to use when converting
in the other direction (from 10646). This can be done in at least 2
different ways: mark one of the mappings in the "to 10646" table as
the one to use in the other direction, or provide a full "from 10646"
table, with or without best fit mappings depending on the outcome of
this discussion.

Erik

On 10/22/06, Mark Davis <mark.davis@icu-project.org> wrote:
> I agree I think it would be far more straightforward and well-defined if all
> non-roundtrip mappings were excluded from the registrations.
>
> Mark
>
>
> On 10/22/06, Erik van der Poel <erikv@google.com> wrote:
> > I have to admit that Kent does make an important point here. The
> > example that really drives that point home is
> > Windows-1252-johndoesbetterfit. The best fit tables provided by
> > Microsoft are their own choices for mappings from the very large
> > Unicode set to smaller sets. Other implementors could and do come up
> > with other choices, depending on their particular product, target
> > market and current compatibility considerations.
> >
> > The most important mapping, in my view, is the one from the charset to
> > Unicode/10646. RFC 2978 is actually a little bit inconsistent here, in
> > that it mentions mappings to 10646 twice, and to/from 10646 only once.
> > Just look for "10646" and you will see what I mean.
> >
> > I believe my attempt to assist in the windows-1252 registration update
> > has revealed a lack of consensus (albeit among a very small number of
> > participants) regarding the "best fit" mappings. I wonder if we should
> > even restrict the normative/recommended 10646 mappings to the "to
> > 10646" mappings, making any supplied "from10646" mappings either
> > purely informative or maybe even unrecommended, since they appear to
> > be controversial.
> >
> > Erik
> >
> > On 10/22/06, Kent Karlsson <kent.karlsson14@comhem.se> wrote:
> > >
> > > Frank Ellermann wrote:
> > > > > ICU may have chosen 0x1A, but that was their own decision. There is
> > > > > no interoperability problem here
> > > >
> > > > An u2w.icu( x ) != u2w.bestfit( x ) effect could be ugly.  For some
> > >
> > > As I said, the fallbacks do not belong in the registration. It should be
> > > perfectly ok to use other fallbacks. E.g. generating higher level
> > > markup,
> > > be it character escapes or more [like <sup>...</sup> for instance, or
> > > <span class="red">...</span>], or some "this-is-even-better-fit".
> > >
> > > The fallbacks ("bestfit") of the "bestfit" file should *NOT* be part of
> > > the IANA charset registration!
> > >
> > > > code pages like < http://purl.net/net/cp/858> ICU tries hard to list
> > > > an "official" substitution character, in that case 0x7F, not 0x1A.
> > >
> > > As I mentioned, the ICU API allows the programmer quite a lot of control
> > > on how to handle conversion errors. One can set it up to automatically
> > > generate XML-ish or Java-ish escapes (which I prefer, even if not
> > > targeting
> > > XML or Java), or to use another "error" character (I would *never*
> > > choose '?'
> > > for that). One can set up ones own callback function for conversion
> > > errors.
> > >
> > > > > Should we strip the best fit mappings from the table and post it
> > > > > somewhere?
> > >
> > > There's one already.
> > >
> > > > They're fine, but could be improved by adding a hint how they were
> > > > determined, and who could fix them if needed.
> > >
> > > The "bestit" one should NOT be used for the registration. It could be
> > > seen as making any "better" converters (e.g. generating XML escapes)
> > > "non-conforming" (each requiring a different charset registration;
> > > 'Windows-1252-XMLescapes',
> 'Windows-1252-XMLescapes-boldnredCSS',
> > >
> 'Windows-1252-XMLescapes-boldnredCSS-butSUPforsuperscripts',
> > > 'Windows-1252-johndoesbetterfit', ...). I hope you
> don't want that.
> > >
> > >                 /kent k
> > >
> > >
> >
>
>

References:
- Re: Best fit
  - From: Frank Ellermann <nobody@xyzzy.claranet.de>
- RE: Best fit
  - From: Kent Karlsson <kent.karlsson14@comhem.se>
- Re: Best fit
  - From: Erik van der Poel <erikv@google.com>
- Re: Best fit
  - From: Mark Davis <mark.davis@icu-project.org>

Prev by Date: Re: Best fit
Next by Date: Re: Best fit (was: Update of charset windows-1252)
Prev by thread: Re: Best fit
Next by thread: Re: Best fit
Index(es):
- Date
- Thread