[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Encoding Standard (mostly complete)



Bjoern Hoehrmann wrote:

> What is your reasoning behind "defining" how to decode UTF-8? It seems
> to me this is well understood and does not require yet another speci-
> fication. Anyone wanting to implement a UTF-8 decoder would have to
> compare your proposal to the other specifications to see if there are
> any differences, and if there are any differences, find out or decide
> if that's due to errors in your specification, and whether they want
> to adopt your specification rather than any of the others. That's not
> a good use of anyone's resources.

Indeed, Anne's definition already diverges from the one supplied by the 
Unicode Standard.

Given the sequence F8 80 80 80 80, the Unicode Standard specifies that a 
decoder should recognize F5 as an invalid UTF-8 code unit, do whatever 
it does on an error condition, and then continue with the next byte. 
This will generate 5 error conditions if handling of errors includes 
trying to continue.

Anne's decoder (section 7.1) will accept the entire sequence, convert it 
to the value 0x200000, and then emit a single decoder error, generating 
only one error condition. The algorithm described on the 
infrastructure.html page, while worded differently, does the same.

Considering that Anne said the existing UTF-8 definition was "minus 
[e.g. lacking] some error details," discrepancies like this seem 
especially egregious.

--
Doug Ewell | Thornton, Colorado, USA
http://www.ewellic.org | @DougEwell ­