[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Revised proposal for UTF-16

To: Larry Masinter <masinter@parc.xerox.com>
Subject: RE: Revised proposal for UTF-16
From: Chris Newman <Chris.Newman@INNOSOFT.COM>
Date: Mon, 27 Jul 1998 11:52:38 -0700 (PDT)
Cc: ietf-charsets@ISI.EDU
In-reply-to: <001e01bdb707$866d3540$15d0000d@copper-208.parc.xerox.com>

On Fri, 24 Jul 1998, Larry Masinter wrote:
> I think we're getting into trouble in this case because we're trying
> to examine all of the possible senders and receivers of UTF-16 and then
> defining when they should or shouldn't include a BOM. However, if you
> had a registered charset, call it "marked-utf-16" with definition:
> 
>      Either big-endian UTF-16
>      or a single BOM followed by little-endian UTF-16
> 
> then it would seem to be clear what a sender should send and what
> a receiver should receive, without all of this complex case analysis.

I find both this solution and the two charsets solution to be acceptable.

Here's the ABNF for UTF-16, where the BOM is optional for
network/big-endian byte-order and mandatory for little-endian byte-order. 
This is unambiguously parsable with one octet of lookahead.  If the
little-endian variation is eliminated, then it's unabiguously parsable
without lookahead. 

UTF-16         = UTF-16BE-STR / UTF-16LE-STR

UTF-16BE-STR   = *UTF-16BE-CHAR
UTF-16BE-CHAR  = UTF-16BE-LO / UTF-16BE-HI / UTF-16BE-SUR
UTF-16BE-LO    = (%x00-d7 / %xe0-fe) %x00-ff
UTF-16BE-HI    = %xff %x00-fd
UTF-16BE-SUR   = %xd8-db %x00-ff %xdc-df %x00-ff

UTF-16LE-STR   = %xff %xfe *UTF-16LE-CHAR
UTF-16LE-CHAR  = UTF-16LE-LO / UTF-16LE-HI / UTF-16LE-SUR
UTF-16LE-LO    = %x00-ff (%x00-d7 / %xe0-fe)
UTF-16LE-HI    = %x00-fd %xff
UTF-16LE-SUR   = %x00-ff %xd8-db %x00-ff %xdc-df

Note that this permits the BOM to be part of the data, so the XML spec
would be compliant with this.

I sure wish that Unicode/ISO-10646 had specified that network-byte order
is required for use in files and on networks, then we might not have had
this problem.  This is just repeating the TIFF magic number mistake on a
grander scale. 

		- Chris

References:
- RE: Revised proposal for UTF-16
  - From: Larry Masinter <masinter@parc.xerox.com>

Prev by Date: Re: Charset reviewer appointed
Next by Date: Re: Charset reviewer appointed
Prev by thread: RE: Revised proposal for UTF-16
Next by thread: Registration of new charset "UTF-16"
Index(es):
- Date
- Thread