[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Encoding Standard (mostly complete)

To: Bjoern Hoehrmann <derhoermi@gmx.net>, Doug Ewell <doug@ewellic.org>
Subject: Re: Encoding Standard (mostly complete)
From: Anne van Kesteren <annevk@opera.com>
Date: Thu, 19 Apr 2012 09:09:56 +0200
Cc: ietf-charsets <ietf-charsets@iana.org>
In-reply-to: <C4F120555D3C44CA90D9E59E5D7F1CF6@DougEwell>
List-Id: <ietf-charsets.mail.apps.ietf.org>
List-Owner: <mailto:ietf-charsets-owner@mail.apps.ietf.org>
List-Subscribe: <mailto:mailserv@mail.apps.ietf.org?subject=subscribe%20ietf-charsets>
List-Unsubscribe: <mailto:mailserv@mail.apps.ietf.org?subject=unsubscribe%20ietf-charsets>
Organization: Opera Software
Original-recipient: rfc822;ned+ietf-charsets@mrochek.com
References: <op.wcwk2xxc64w2qv@annevk-macbookpro.local><c8iso7p2be97oohvd9v8kcim9njo1ds86n@hive.bjoern.hoehrmann.de><C4F120555D3C44CA90D9E59E5D7F1CF6@DougEwell>
Spam-test: False ; 1.0 / 4.5 ; SPF_SOFTFAIL
User-Agent: Opera Mail/11.62 (MacIntel)

On Thu, 19 Apr 2012 02:33:35 +0200, Doug Ewell <doug@ewellic.org> wrote:
> Indeed, Anne's definition already diverges from the one supplied by the  
> Unicode Standard.
>
> Given the sequence F8 80 80 80 80, the Unicode Standard specifies that a  
> decoder should recognize F5 as an invalid UTF-8 code unit, do whatever  
> it does on an error condition, and then continue with the next byte.  
> This will generate 5 error conditions if handling of errors includes  
> trying to continue.
>
> Anne's decoder (section 7.1) will accept the entire sequence, convert it  
> to the value 0x200000, and then emit a single decoder error, generating  
> only one error condition. The algorithm described on the  
> infrastructure.html page, while worded differently, does the same.
>
> Considering that Anne said the existing UTF-8 definition was "minus  
> [e.g. lacking] some error details," discrepancies like this seem  
> especially egregious.

My apologies, I just went with HTML on this, but it seems Internet  
Explorer / Safari / Chrome handle this as you say, so we should just  
remove the handling of those byte sequences in this manner and make sure  
Opera and Gecko are fixed.

The bug I filed on Gecko can be found here:  
https://bugzilla.mozilla.org/show_bug.cgi?id=746900

The bug I filed on Opera can be found at CORE-45840 if you have access to  
our system.

The specification is fixed: http://dvcs.w3.org/hg/encoding/rev/f2f234e98474

Thank you!


-- 
Anne van Kesteren
http://annevankesteren.nl/

References:
- Encoding Standard (mostly complete)
  - From: Anne van Kesteren <annevk@opera.com>
- Re: Encoding Standard (mostly complete)
  - From: Bjoern Hoehrmann <derhoermi@gmx.net>
- Re: Encoding Standard (mostly complete)
  - From: Doug Ewell <doug@ewellic.org>

Prev by Date: Re: Encoding Standard (mostly complete)
Next by Date: Re: Encoding Standard (mostly complete)
Prev by thread: Re: Encoding Standard (mostly complete)
Next by thread: Re: Encoding Standard (mostly complete)
Index(es):
- Date
- Thread