[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: ISO 8859 -8:1999
Precisely. As discussed in TR22,
1.2.1:
There may be cases where a specified character mapping table is not
available. In such cases, a best-fit mapping table can be used. However, this
technique should be used with caution, since otherwise data can be corrupted.
For example, in XML there are different strategies depending on whether the
process is parsing or generating.
Suppose that you have two sets X and SUB_X, where X
is a superset of SUB_X. (That is, every roundtrip mapping that is in SUB_X is
also in X, and X may contain additional round-trip mappings.) Then:
- It is ok to parse with X when the file is tagged
as SUB_X. Since X is a superset, all the characters will be read correctly.
Any characters that are not in SUB_X will be encoded as NCRs (e.g.
ꯍ), and will work.
- It is ok to generate the file with SUB_X, and
tag the file as X. As long as you convert the characters that are not in
SUB_X into NCRs, everything works.
- What is NOT ok is to parse with SUB_X when the
file is tagged with X — characters will be corrupted.
- What is NOT ok is to generate the file with X,
and tag the file with SUB_X — characters will be corrupted.
—————
----- Original Message -----
Sent: Tuesday, December 04, 2001 21:27
Subject: RE: ISO 8859 -8:1999
> Hello Francois,
>
> I'm not exactly sure your argument
with UTF-8 works.
>
> The problem becomes obvious if we assume that
the UCS
> is the reference. In this case, additions to the UCS
> can
always be made to work as good as the current
> implementation (because we
don't expect conversion
> to legacy encodings to work, except for
something
> like NCRs in HTML/XML). However, additions to legacy
>
encodings will break interoperability if new data
> is sent to old
implementations with the old label,
> in particular if the implementation
converts to
> UCS internally (which more and more implementations
>
do nowadays).
>
> Regards, Martin.
>
> At
14:15 01/12/02 -0500, Francois Yergeau wrote:
> >Keld Jn Simonsen
wrote:
> > > Jonathan Rosenne wrote:
> > >>
Justification: ISO_8859-8:1999 is a superset of ISO_8859-8:1988. Valid
>
> >> ISO_8859-8:1988 data will still be valid under ISO_8859-8:1999.
The new
> > >> characters were reserved in ISO_8859-8:1988.
Registering ISO_8859-8:1999
> > >> as a separate character set
would cause too much unnecessary confusion.
> > >
> >
>As the two charsets are not exactly equivalent, they should not be
>
> >aliases.
> >
> >Not so fast, please. This
question was debated at length, on this very
> >list, when RFC 2279 was
under scrutiny. The result was that the label
> >"UTF-8" was
determined to be appropriate for all version of Unicode after
> >1.1,
provided no incompatible change occurs, which is the same as saying
>
>that subsequent versions have a superset relationship to previous
ones. The
> >arguments are in the RFC itself, section 5:
>
>
> > It is noteworthy that the label "UTF-8" does
not contain a version
> > identification, referring
generically to ISO/IEC 10646. This is
> >
intentional, the rationale being as follows:
> >
>
> A MIME charset label is designed to give just the
information needed
> > to interpret a sequence of
bytes received on the wire into a sequence
> > of
characters, nothing more (see RFC 2045, section 2.2, in [MIME]).
>
> As long as a character set standard does not change
incompatibly,
> > version numbers serve no purpose,
because one gains nothing by
> > learning from the
tag that newly assigned characters may be received
>
> that one doesn't know about. The tag itself doesn't
teach anything
> > about the new characters, which
are going to be received anyway.
> >
> >
Hence, as long as the standards evolve compatibly, the apparent
>
> advantage of having labels that identify the versions is
only that,
> > apparent. But there is a
disadvantage to such version-dependent
> > labels:
when an older application receives data accompanied by a
>
> newer, unknown label, it may fail to recognize the label
and be
> > completely unable to deal with the data,
whereas a generic, known
> > label would have
triggered mostly correct processing of the data,
> >
which may well not contain any new characters.
> >
>
> ["Korean mess" paragraph elided]
> >
>
> In practice, then, a version-independent label is
warranted, provided
> > the label is understood to
refer to all versions after Amendment 5,
> > and
provided no incompatible change actually occurs. Should
>
> incompatible changes occur in a later version of ISO/IEC
10646, the
> > MIME charset label defined here will
stay aligned with the previous
> > version until and
unless the IETF specifically decides otherwise.
> >
>
>
> >I think the same argument can apply to any other charset that
evolves
> >compatibly, i.e. later versions are strict supersets of
earlier ones and
> >nothing else creates incompatibility.
Jonathan tells us that this is the
> >case with ISO_8859-8:1988 and
:1999.
> >
> >This does not prevent the IANA registry from
containing superset
> >relationship information.
> >
>
>Regards,
> >
> >--
> >Fran輟is Yergeau
>
>