[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Encoding Standard (mostly complete)
On Thu, 19 Apr 2012 02:33:35 +0200, Doug Ewell <doug@ewellic.org> wrote:
> Indeed, Anne's definition already diverges from the one supplied by the
> Unicode Standard.
>
> Given the sequence F8 80 80 80 80, the Unicode Standard specifies that a
> decoder should recognize F5 as an invalid UTF-8 code unit, do whatever
> it does on an error condition, and then continue with the next byte.
> This will generate 5 error conditions if handling of errors includes
> trying to continue.
>
> Anne's decoder (section 7.1) will accept the entire sequence, convert it
> to the value 0x200000, and then emit a single decoder error, generating
> only one error condition. The algorithm described on the
> infrastructure.html page, while worded differently, does the same.
>
> Considering that Anne said the existing UTF-8 definition was "minus
> [e.g. lacking] some error details," discrepancies like this seem
> especially egregious.
My apologies, I just went with HTML on this, but it seems Internet
Explorer / Safari / Chrome handle this as you say, so we should just
remove the handling of those byte sequences in this manner and make sure
Opera and Gecko are fixed.
The bug I filed on Gecko can be found here:
https://bugzilla.mozilla.org/show_bug.cgi?id=746900
The bug I filed on Opera can be found at CORE-45840 if you have access to
our system.
The specification is fixed: http://dvcs.w3.org/hg/encoding/rev/f2f234e98474
Thank you!
--
Anne van Kesteren
http://annevankesteren.nl/