[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Registration of new charset BOCU-1 refreshed - UTF-8



Martin Duerst wrote:

> At 11:05 02/08/23 -0700, Markus Scherer wrote:
> 
>> Note that, aside from current support and preference, there is no 
>> reason to over-emphasize UTF-8 for the Internet.
>> For as long as a Unicode charset is used, conversion will be fast and 
>> there will be no mapping difference problems.
>> Each Unicode charset has its strengths and weaknesses, and protocols 
>> and software should select them accordingly.
> 
> Protocols have to work together.


I fully agree, and I am very much and painfully aware of the issues involved.

> Having a proliferation of Unicode
> encodings is about as problematic as having a proliferation of
> legacy encodings.


Respectfully, I would like to disagree on this point.
The use of non-Unicode charsets opens a whole different, huge Pandora's box of problems, which are well described in Unicode TR 22 and the XML Japanese Profile.

All Unicode charsets are easily decoded in relatively small and fast code (even SCSU and BOCU-1), without any confusion about what Unicode code point any byte sequence maps to.
Mapping tables for non-Unicode charsets can be large - e.g., ICU's standard set uses about 5MB of data, while there is 0 for Unicode charsets.

I am also not planning to invent any more Unicode charsets for public use. Mark and I created BOCU-1 because of its advantages - in certain environments - over the other Unicode charsets.

As far as I can tell, there are only 6 Unicode charsets that are useful in protocols with various pro's and con's:

   UTF-8, UTF-16, UTF-16BE, UTF-16LE, SCSU, BOCU-1

There are only 3 more Unicode charsets for modern use (but too expensive to be useful in data exchange):

   UTF-32, UTF-32BE, UTF-32LE

These Unicode charsets are all well-defined, fast, and efficient to implement because they all use the Unicode/UCS coded character set.
This is not much of a "proliferation".

(There are admittedly a few more Unicode charsets that are obsolete or not designed for public use, see the end of this email.)

By contrast, there are hundreds of non-Unicode charsets and CCSes, or many thousands if you count their implementation variations.

> And the IETF has decided to put more weight
> on UTF-8 than on other encodings (in my view for very good reasons).
> Please see RFC 2277.


I have read it before and just re-read it. This policy is reasonable and well-intentioned.
However, it does not address issues of performance.
Where text size and/or conversion time matter, UTF-8 may not be the best choice compared with other Unicode charsets.
UTF-8 is good and fairly simple, but not a "silver bullet".

My personal recommendation to developers is to consider only the above 6 Unicode charsets plus US-ASCII and ISO-8859-1 for all data transfers.

Best regards,

markus

----
Appendix A: Registered Unicode charsets that are obsolete or not for public use:

ISO-10646-UCS-2           Character Encoding Form, not a "charset"
ISO-10646-UCS-4           Character Encoding Form, not a "charset"
UNICODE-1-1               obsolete
UTF-7                     obsolete
CESU-8                    "not intended nor recommended for open interchange"
UNICODE-1-1-UTF-7         obsolete

Appendix B: Charsets whose registrations refer to UCS or Unicode but are in fact not Unicode charsets:

ISO-10646-UCS-Basic       (US-ASCII, or may just be a CCS not a charset)
ISO-10646-Unicode-Latin1  (ISO-8859-1, or may just be a CCS not a charset)
ISO-Unicode-IBM-1261
ISO-Unicode-IBM-1268
ISO-Unicode-IBM-1276
ISO-Unicode-IBM-1264
ISO-Unicode-IBM-1265