[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: RFC 2279 (UTF-8) to Full Standard



Tony Hansen wrote:

> One of the advantages about UTF8 that I've repeatedly heard touted was 
> that it was NOT restricted to 10FFFF, and indeed could handle the entire 
> 32-bit codespace when such codes were eventually allocated. This was 
> often used as an argument against other encodings, such as UTF16, that 
> didn't have the same property.

And as you can see by my just cited quotation from 10646 itself, such
argumentation was always a kind of shell game by detractors of UTF-16
and Unicode. The people making such arguments were not plugged in to
the process in ISO and were apparently unaware that WG2 itself was
keenly aware of the interoperability problems and eager to ensure that
all UTF's for 10646 were *equally* applicable to all characters encoded
in the standard.

And the repeated concerns about the "eventual allocation" of characters
in the 32-bit codespace that UTF-16 could not handle have reached
the status of urban legends -- endlessly repeated among those in the
Linux community who use repetition to define accuracy, without bothering
to check with the source. These urban legends are grounded neither
in the standard, nor in fact, nor in need, nor even in the capabilities of
the standards committees. At current rates it will literally take
*centuries* for the character encoding committees to fill up U+0000..U+EFFFD.
Furthermore, *all* known candidates for character encoding, generously
calculated -- and we have been scouring obscure sources now for well
over a decade, including many, many minority and historic scripts I
guarantee you will never have heard of -- will amount to less than
25% of the available codespace.

--Ken

P.S. Please feel free to forward this on to those who have been
repeating the urban legend. ;-)

> 
> 	Tony Hansen
> 	tony@att.com