[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

UTF-8 vs. Unicode (was: Re: Volunteer needed to serve as IANA charsetreviewer



Following up on this topic:

> As for utf-8 vs. Unicode, this is a bit tricky.  I agree that merely
> specifying Unicode isn't sufficient given the potential for
> incompatible CESs.

UTF-8 vs. Unicode is an incomplete way of specifying the
distinctions to be made. It is a level-appropriateness issue.

If your concern is specification of the character semantics,
then you designate the Unicode Standard (or the equivalent
ISO/IEC 10646) and a version level to get the exact
repertoire.

If your concern is memory representation or API support then
you designate one of the 3 Character Encoding Forms formally
and normatively defined in the Unicode Standard (and equivalently
in ISO/IEC 10646): UTF-8, UTF-16, or UTF-32.

If your concern is serial byte representation in a char-oriented
protocol or stream, then you designate one of the CES's formally
and normatively defined in the Unicode Standard: UTF-8,
(UTF-16BE, UTF-16LE, UTF-16 with BOM), (UTF-32BE, UTF-32LE, UTF-32 with
BOM).

All of the CES's are fully interoperable and compatible with
each other. And only those CES's normatively defined in
the Unicode Standard should be considered CES's of Unicode.

>  And yet I'm sympathetic to the notion that UTF-8
> pessimizes storage and transmission of text written in certain
> languages.  IMHO it's unreasonable to exclude the potential for a
> Unicode based CES that has more-or-less equivalent information
> density across a wide variety of languages.

Ah, but that is precisely none other than UTF-16, and is in
widespread use for that reason and other reasons. But it doesn't
make much sense for the web or for most internet protocols,
because of the already existing ubiquity of UTF-8 in those
contexts.

>  But I do think that use of
> multiple CESs in a new protocol should require substantial
> justification, and that UTF-8 should be presumed to be the CES of
> choice for any new protocol that requires ASCII compatibility for its
> character representation.

I agree completely with that assessment.

--Ken

> 
> Keith
> 
>