[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: internationalization/ISO10646 question
(1) When UTF-8 leaks out with a BOM, that is the result of buggy software
since the BOM simply isn't needed for UTF-8.
(2) This is not an issue in most Internet protocols since the standards
have required proper charset labelling for many years. Ironically, most of
the countries with widely deployed software that violates the standards by
emitting unlabelled charsets use encodings that are very easy to
distinguish from UTF-8.
(3) The UTF-8 overlong sequence issue is sufficiently well documented that
any security problems in practice are the result of buggy code. It's an
extremely minor security issue now, particularly compared to the lookalike
character problem which impacts all encodings of Unicode and many other
character sets.
(4) If octet count is an issue use a general purpose compression layer
which will vastly exceed any savings possible with encoding tricks.
Is UTF-8 perfect? No. But the costs greatly outweight the benefits when
compared to any other charset I've seen, and particularly when compared to
UTF-16.
- Chris
begin quotation by MURATA Makoto on 2002/12/25 11:51 +0900:
> On Fri, 06 Dec 2002 13:13:41 -0800
> Chris Newman <Chris.Newman@sun.com> wrote:
>
>>
>> UTF-16 is a terrible encoding for interoperability. There are 3
>> published non-interoperable variants of UTF-16 (big-endian,
>> little-endian, BOM/switch-endian) and only one of the variants can be
>> auto-detected with any chance of success (and none of them can be
>> auto-detected as well as UTF-8).
>
> Unfortunately, as far as I know, UTF-8 is not free of such problems.
> (1) With or without the Unicode signature, (2) possible confusion with
> other ASCII-compatible encodings (especially when a program has a few
> non-ASCII characters), (3) vulnerability caused by redundant octet
> sequences, and (4) use of 4 or 6 octets for non-BMP characters (e.g.,
> writeUTF and readUTF of java.io.DataOutput). I know that Corrigendum
> #1: UTF-8 Shortest Form addresses (3), but I am not sure if
> implementations are free of this vulnerability.
>
> I would be very happy if some encoding of Unicode becomes free of
> interoperability or security problems. But I am not happy yet.
>
> --
> MURATA Makoto <murata@hokkaido.email.ne.jp>
>