[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Best fit
I have to admit that Kent does make an important point here. The
example that really drives that point home is
Windows-1252-johndoesbetterfit. The best fit tables provided by
Microsoft are their own choices for mappings from the very large
Unicode set to smaller sets. Other implementors could and do come up
with other choices, depending on their particular product, target
market and current compatibility considerations.
The most important mapping, in my view, is the one from the charset to
Unicode/10646. RFC 2978 is actually a little bit inconsistent here, in
that it mentions mappings to 10646 twice, and to/from 10646 only once.
Just look for "10646" and you will see what I mean.
I believe my attempt to assist in the windows-1252 registration update
has revealed a lack of consensus (albeit among a very small number of
participants) regarding the "best fit" mappings. I wonder if we should
even restrict the normative/recommended 10646 mappings to the "to
10646" mappings, making any supplied "from10646" mappings either
purely informative or maybe even unrecommended, since they appear to
be controversial.
Erik
On 10/22/06, Kent Karlsson <kent.karlsson14@comhem.se> wrote:
>
> Frank Ellermann wrote:
> > > ICU may have chosen 0x1A, but that was their own decision. There is
> > > no interoperability problem here
> >
> > An u2w.icu( x ) != u2w.bestfit( x ) effect could be ugly. For some
>
> As I said, the fallbacks do not belong in the registration. It should be
> perfectly ok to use other fallbacks. E.g. generating higher level
> markup,
> be it character escapes or more [like <sup>...</sup> for instance, or
> <span class="red">...</span>], or some "this-is-even-better-fit".
>
> The fallbacks ("bestfit") of the "bestfit" file should *NOT* be part of
> the IANA charset registration!
>
> > code pages like <http://purl.net/net/cp/858> ICU tries hard to list
> > an "official" substitution character, in that case 0x7F, not 0x1A.
>
> As I mentioned, the ICU API allows the programmer quite a lot of control
> on how to handle conversion errors. One can set it up to automatically
> generate XML-ish or Java-ish escapes (which I prefer, even if not
> targeting
> XML or Java), or to use another "error" character (I would *never*
> choose '?'
> for that). One can set up ones own callback function for conversion
> errors.
>
> > > Should we strip the best fit mappings from the table and post it
> > > somewhere?
>
> There's one already.
>
> > They're fine, but could be improved by adding a hint how they were
> > determined, and who could fix them if needed.
>
> The "bestit" one should NOT be used for the registration. It could be
> seen as making any "better" converters (e.g. generating XML escapes)
> "non-conforming" (each requiring a different charset registration;
> 'Windows-1252-XMLescapes', 'Windows-1252-XMLescapes-boldnredCSS',
> 'Windows-1252-XMLescapes-boldnredCSS-butSUPforsuperscripts',
> 'Windows-1252-johndoesbetterfit', ...). I hope you don't want that.
>
> /kent k
>
>
- References:
- Re: Best fit
- From: Frank Ellermann <nobody@xyzzy.claranet.de>
- RE: Best fit
- From: Kent Karlsson <kent.karlsson14@comhem.se>