[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

unknown-xyz (was: Volunteer needed to serve as IANA charset reviewer)



Claus F„rber wrote:

>>> "UNKNOWN-UTF16"
>> What's the difference from UTF-16 ?

> UTF-16 "SHOULD be interpreted as being big-endian" if there's
> no BOM, RFC 2781, 4.3. UNKNOWN-UTF16 would not have such a
> fall back.

Okay, but with a good excuse violating a SHOULD is possible...

>>> with alias "UNICODE".
>> Ugh, thanks, but no thanks.

> The idea is to deprecate the label "UNICODE" by tying it to
> an incompletly specified charset.

...sneaky <g>

In reality that boils down to "any even number of octets not
including 0xfeff or 0xfffe", or do I miss something ?  Who
could be interested in that difference from "unknown-8bit" ?

---
>>> "UNKNOWN-ISO-8859" with alias "ANSI".
>>> "UNKNOWM-IBMPC" with alias "OEM".

>> One of those could do, "unknown-ascii-8bit", alias "oem".

> We already have UNKNOWN-8BIT.
> When you convert legacy data, you often DO know that 
> something is in a DOSish (IBMPC-based) or Windowsish
> (ANSI-based) charset. Having charset labels to carry
> this information (instead of the unspecified UNKNOWN-8BIT)
> is a good idea.

Yes, but why the difference, who's supposed to guess what's
what, and who's interested in the dubious outcome of such
guesses ?

If I screw-up what you get is a bogus "Latin-1", and you can
correctly guess that it must be bogus as soon as you find any
C1 octets.  But without human intervention you don't know how
I screwed up, it's windows-1252, pc-multilingual-850+euro, or
worse (cp437, wild mixtures, who knows).

An "unknown-ascii-8bit" => neither ISO-8859-x nor UTF-8, but
at least MIME compatible (one hopes).

The W3C validator could make use of that "unknown-ascii-8bit",
one error for that (if it's only a guess), but then continue
to report unrelated interesting errors.

Frank
-- 
Honk for 4234 to STD