[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: ISO 8859 -8:1999

To: Martin Duerst <[email protected]>, Mark Davis <[email protected]>,Jonathan Rosenne <[email protected]>
Subject: Re: ISO 8859 -8:1999
From: Mark Davis <[email protected]>
Date: Sat, 01 Dec 2001 05:55:20 -0800
Cc: [email protected]
References: <4.2.0.58.J.20011201151850.0448a870@localhost>

I agree with Martin that knowing supersets (and subsets) would be useful
information. If there were another format for aliases, that said that some
other set is a superset or subset or close to (but not the same as) this
set, that would be quite useful. However, in order to avoid corrupting data,
any such superset/subset relationship must be actually be true, and true on
all platforms that implement both charsets.

That seems obvious, but there are pitfalls -- character sets that may seem
to be supersets may not, in reality, be such. For example, the cp1252 family
as implemented on Windows, will to silently map any "unassigned" character
in 0x80..0x9F into U+0080..U+009F, and back again. When a new character was
added, the mapping was changed, as for Euro.  So the new map is *not* a
superset of the old -- one will get distinctly different mappings for the
same Unicode character. So the most that can be said about the relationship
between windows-windows-1252-1998 and windows-1252-2000 is that they are
"close to" one another, *not* that the latter is a subset of the former.

To make it even more confusing, other versions of cp1252 may have
super/subset relations. For example, java-Cp1252-1.1_P.xml is a full subset
of java-Cp1252-1.3_P.xml.

For more examples, see:
http://oss.software.ibm.com/cvs/icu/~checkout~/charset/data/xml/windows-1252
-2000.xml (for the windows definition)
http://oss.software.ibm.com/cvs/icu/~checkout~/charset/data/xml/java-Cp1252-
1.3_P.xml (for the java definition)
(the older versions are not posted)

For a list of the data we have collected so far on character set mappings,
see http://oss.software.ibm.com/icu/charset/index.html. In particular, there
is a generated analysis of different sets on
http://oss.software.ibm.com/icu/charset/roundtripIndex.html.

Note: while the IANA registry is the best that we have, assuming that an
IANA ID always mean the same thing will result in data corruption. Take
1252, for example:

aix-IBM_1252-4.3.6 is identical to windows-1252-2000, but only if fallback
mappings are excluded.
java-Cp1252-1.3_P and glibc-CP1252-2.1.2 are 98.05% the same, not identical.

Even with the 8859 series there are differences -- search on
roundtripIndex.html for 8859_7, for example. When we get to East Asian sets,
there are a considerable number of variants.

Mark
—————

Ὀλίγοι ἔμφονες πολλῶν ἀφρόνων φοβερώτεροι — Πλάτωνος
[For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr]

http://www.macchiato.com
----- Original Message -----
From: "Martin Duerst" <[email protected]>
To: "Mark Davis" <[email protected]>; "Jonathan Rosenne"
<[email protected]>
Cc: <[email protected]>
Sent: Friday, November 30, 2001 22:25
Subject: Re: ISO 8859 -8:1999

> At 18:04 01/11/30 -0800, Mark Davis wrote:
>
> >Supersets will still cause data corruption problems. See
> >http://www.unicode.org/unicode/reports/tr22/, especially Section 1.2.1
> >
> >Mark
>
> There are many cases where knowing a superset relationship would
> help. We might propose to IANA that they maintain this information,
> or we might just ask upcomming registrations to include that in
> their registration information.
>
> Regards,    Martin.
>

References:
- Re: ISO 8859 -8:1999
  - From: Martin Duerst <[email protected]>

Prev by Date: Re: Fw:(by [email protected])Re: Registering GBK and GB18030 in theIANA charset registry
Next by Date: Re: ISO 8859 -8:1999
Prev by thread: Re: ISO 8859 -8:1999
Next by thread: RE: ISO 8859 -8:1999
Index(es):
- Date
- Thread