[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Encoding Standard (mostly complete)



On Thu, 19 Apr 2012 02:33:35 +0200, Doug Ewell <doug@ewellic.org> wrote:
> Indeed, Anne's definition already diverges from the one supplied by the  
> Unicode Standard.
>
> Given the sequence F8 80 80 80 80, the Unicode Standard specifies that a  
> decoder should recognize F5 as an invalid UTF-8 code unit, do whatever  
> it does on an error condition, and then continue with the next byte.  
> This will generate 5 error conditions if handling of errors includes  
> trying to continue.
>
> Anne's decoder (section 7.1) will accept the entire sequence, convert it  
> to the value 0x200000, and then emit a single decoder error, generating  
> only one error condition. The algorithm described on the  
> infrastructure.html page, while worded differently, does the same.
>
> Considering that Anne said the existing UTF-8 definition was "minus  
> [e.g. lacking] some error details," discrepancies like this seem  
> especially egregious.

My apologies, I just went with HTML on this, but it seems Internet  
Explorer / Safari / Chrome handle this as you say, so we should just  
remove the handling of those byte sequences in this manner and make sure  
Opera and Gecko are fixed.

The bug I filed on Gecko can be found here:  
https://bugzilla.mozilla.org/show_bug.cgi?id=746900

The bug I filed on Opera can be found at CORE-45840 if you have access to  
our system.

The specification is fixed: http://dvcs.w3.org/hg/encoding/rev/f2f234e98474

Thank you!


-- 
Anne van Kesteren
http://annevankesteren.nl/