[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Encoding Standard (mostly complete)



(2012/04/19 9:33), Doug Ewell wrote:
> Given the sequence F8 80 80 80 80, the Unicode Standard specifies that a
> decoder should recognize F5 as an invalid UTF-8 code unit, do whatever
> it does on an error condition, and then continue with the next byte.
> This will generate 5 error conditions if handling of errors includes
> trying to continue.
Where TUS defines this? It seems to contradict TUS 6.1.0 p.96:
http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf#page=42
|Although a UTF-8 conversion process is required to never consume
|well-formed subsequences as part of its error handling for ill-formed
|subsequences, such a process is not otherwise constrained in how it
|deals with any ill-formed subsequence itself. An ill-formed subsequence
|consisting of more than one code unit could be treated as a single
|error or as multiple errors. For example, in processing the UTF-8 code
|unit sequence <F0 80 80 41>, the only formal requirement mandated by
|Unicode conformance for a converter is that the <41> be processed and
|correctly interpreted as <U+0041>. The converter could return
|<U+FFFD, U+0041>, handling <F0 80 80> as a single error, or
|<U+FFFD, U+FFFD, U+FFFD, U+0041>, handling each byte of <F0 80 80> as a
|separate error, or could take other approaches to signalling <F0 80 80>
|as an ill-formed code unit subsequence.
It is exactly a purpose of Encoding Standard to avoid these kind of
vagueness.
-- 
VYV03354@nifty.ne.jp