[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Proposed changes to UTF-8 draft

To: Francois Yergeau <FYergeau@alis.com>
Subject: Re: Proposed changes to UTF-8 draft
From: Keld Jørn Simonsen <keld@dkuug.dk>
Date: Fri, 10 Jan 2003 21:05:56 +0100
Cc: ietf-charsets@iana.org
In-reply-to: <F7D4BDA0E5A1D14B99D32C022AEB7366A507D2@alis-2k.alis.domain>
Original-recipient: rfc822;ned+ietf-charsets@mrochek.com
References: <F7D4BDA0E5A1D14B99D32C022AEB7366A507D2@alis-2k.alis.domain>
Spam-test: False ; -2.3 / 5.2
User-Agent: Mutt/1.3.27i

On Fri, Jan 10, 2003 at 02:47:53PM -0500, Francois Yergeau wrote:
> Keld Jørn Simonsen wrote:
> > I think we should keep ourselves to open standards whenever possible,
> > and avoid industry standards like Unicode if we can.
> 
> I dispute the characterization of ISO standards as open.  The
> standardization process is totally closed (only National Bodies can play)
> and the standards themselves, with few exceptions not including 10646, are
> available only for money.

I would characterize that as FUD. You can join the national bodies,
and that is feasible even for one-person firms. That they are national
means that you can influence the specifications without having big
travel expenses. Yes, most ISO standards cost money, but if ypu want the
information, the standards are available to you in most countries via
the public library systems, for free.

> > 10646 is pretty explicit about not using surrogates in UTF-8,
> > as far as I know. Always was.
> 
> Please re-read Annex D.  The only mention is this Note:
> 
>   NOTE 1 - Values of x in the range 0000 D800 .. 0000 DFFF
>   are reserved for the UTF-16 form and do not occur in UCS-4.
>   The values 0000 FFFE and 0000 FFFF also do not occur
>   (see clause 8). The mappings of these code positions in
>   UTF-8 are undefined.
> 
> There's a later section D.7 "Incorrect sequences of octets: Interpretation
> by receiving devices" which is totally silent on decoding surrogates and
> overlong sequences.

It is becacuse UTF-8 in the ISO 10646 definition only encodes characters
defined in 10646. And "surrogates" are not characters. So they "do not
occur" in UTF-8. 

Kind regards
keld

Follow-Ups:
- Re: Proposed changes to UTF-8 draft
  - From: Roozbeh Pournader <roozbeh@sharif.edu>

References:
- RE: Proposed changes to UTF-8 draft
  - From: Francois Yergeau <FYergeau@alis.com>

Prev by Date: Re: Proposed changes to UTF-8 draft
Next by Date: Re: Proposed changes to UTF-8 draft
Prev by thread: RE: Proposed changes to UTF-8 draft
Next by thread: Re: Proposed changes to UTF-8 draft
Index(es):
- Date
- Thread