[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

UTF or bare encoding



> There is actually a community of objections to UTF-2.  They are based
> on:
> 
> (1) For email purposes, and other situations with 7-bit constraints,
> UTF-2, by using an 8-bit form, requires double encoding.  There are
> direct encodings of 16 or 32 bits to 7 bits that save time and maybe
> space.

There is a famous marketing hype that

	Today, computer is fast enough

This is simply untrue. Instead,

	Today, CPU is fast enough but I/O is slow

For these 10 years, CPUs have become 1000 times faster. Rest of the
components of computers have become, at most, about only 10 times faster.
Moreover, the trend is expected to continue.

Thus, now, the cost of simple processing must be measured with the amount
of I/O (space) not by the amount of CPU cycles (so called time).

> (2) The variable-length nature of UTF-2 is optimal for ASCII and code
> points "low" in the 10646 sequence.  It is pretty bad for the "upper
> end" of the BMP (UNICODE, UCS-2), and could get really pathological if
> the "high end" code positions of 10646 were used.  So, to a certain
> extent, choosing it requires assuming that those higher code positions
> will never be used, or that the communities that will use them are never
> going to be important to the Internet.

That's why I improved UTF-2 to create IUTF which has several thousands
of extra 2 octect space and much more extra 3 octet space.

> A straight 32-bit coding,
> possibly supplemented by conventional compression, does not have that
> problem.

It is completely inappropriate to assume compression here.

Compression is equally applicable to UTF-2. Proper compression of UTF-2
into 7 bits will also solve the issue (1) above.

On the other hand, I don't think we need UTF representation of
more than 5 octets so often.

Thus, straight 32 bit is, actually, always worse than UTF.

						Masataka Ohta

PS

Of course, the worst problem of straight encoding is that we need
two text types, which cancels most of the merit of UNIX.