[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: windows-1252



On 1/14/06, Erik van der Poel <erik@vanderpoel.org> wrote:
> > The latter was apparently supplied by another person claiming
> > to speak for MS.  Either 0x81, 0x8D, 0x8F, 0x90, and 0x9D are
> > mapped one to one, u+0081 etc., or they are not.  As long as
> > that's not absolutely clear and also reflected in the Unicode
> > mappings modifying the IANA registry about 1252 makes no sense.
>
> RFC 2978 does not require a Unicode mapping. It says that there "SHOULD"
> be a 10646 mapping, but it does not use the word "MUST".

Section 1.3 does require that "the definition associated with a
charset name must fully specify the mapping to be performed." - even
though it does not require a *Unicode* conversion table.

Given that much modern software operates in Unicode internally, it is
desirable to have an authoritative Unicode conversion table,
especially if it could be kept up to date with (hopefully compatible)
changes.

> I agree that it is nice to have the 10646 mapping, but are unassigned
> codepoints not allowed to exist in IANA-registered charsets (other than
> UTF-8 and all the 10646-based charsets)? If so, where in the RFC does it
> say that?

Unassigned code points are of course allowed. Without double-checking
the tables which Mike pointed to in his requests, I think what Frank
alluded to is how Windows treats unassigned codes in its SBCS
charsets: It usually roundtrips unassigned bytes xx to/from Unicode
U+00xx, rather than mapping unassigned codes to some SUBstitution
character, and some but not all published tables reflect this.

As a result, when such an undefined but roundtripping code gets a
character assigned to it, then its roundtrip mapping changes to the
new Unicode code point. There is usually no fallback (one-way) mapping
from the old U+00xx to the now-assigned xx.

If you treat the presence of a roundtrip mapping (to/from Unicode or
any other character set) as equivalent to establishing a byte
sequence's character identity, then windows-1252 used to have a C1
control code at 0x80 while now it has a prominent currency symbol
assigned to that code - which would be an incompatible change. Of
course the documentation of windows-1252 used to show 0x80 as
undefined before the change.

I am not arguing for or against any registration here, merely
explaining what I have seen in Windows conversion.

Note that a charset is defined as "a method of converting a sequence
of octets into a sequence of characters" so roundtrip mappings don't
formally come into play. There is also no definition for when a
charset is "stable", for allowing or disallowing extension by
assigning characters to formerly unassigned codes (or otherwise), or
for whether two charsets are considered different or compatible.

Final note: windows-874 and windows-1252 and the other charsets that
Mike Ksar requested to register are of course the Windows "ANSI"
system code pages which are very widely used.

markus

--
Opinions expressed here may not reflect my company's positions unless
otherwise noted.