[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
unknown-xyz (was: Volunteer needed to serve as IANA charset reviewer)
Claus F„rber wrote:
>>> "UNKNOWN-UTF16"
>> What's the difference from UTF-16 ?
> UTF-16 "SHOULD be interpreted as being big-endian" if there's
> no BOM, RFC 2781, 4.3. UNKNOWN-UTF16 would not have such a
> fall back.
Okay, but with a good excuse violating a SHOULD is possible...
>>> with alias "UNICODE".
>> Ugh, thanks, but no thanks.
> The idea is to deprecate the label "UNICODE" by tying it to
> an incompletly specified charset.
...sneaky <g>
In reality that boils down to "any even number of octets not
including 0xfeff or 0xfffe", or do I miss something ? Who
could be interested in that difference from "unknown-8bit" ?
---
>>> "UNKNOWN-ISO-8859" with alias "ANSI".
>>> "UNKNOWM-IBMPC" with alias "OEM".
>> One of those could do, "unknown-ascii-8bit", alias "oem".
> We already have UNKNOWN-8BIT.
> When you convert legacy data, you often DO know that
> something is in a DOSish (IBMPC-based) or Windowsish
> (ANSI-based) charset. Having charset labels to carry
> this information (instead of the unspecified UNKNOWN-8BIT)
> is a good idea.
Yes, but why the difference, who's supposed to guess what's
what, and who's interested in the dubious outcome of such
guesses ?
If I screw-up what you get is a bogus "Latin-1", and you can
correctly guess that it must be bogus as soon as you find any
C1 octets. But without human intervention you don't know how
I screwed up, it's windows-1252, pc-multilingual-850+euro, or
worse (cp437, wild mixtures, who knows).
An "unknown-ascii-8bit" => neither ISO-8859-x nor UTF-8, but
at least MIME compatible (one hopes).
The W3C validator could make use of that "unknown-ascii-8bit",
one error for that (if it's only a guess), but then continue
to report unrelated interesting errors.
Frank
--
Honk for 4234 to STD