[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Revised proposal for UTF-16



On Fri, 24 Jul 1998, Larry Masinter wrote:
> I think we're getting into trouble in this case because we're trying
> to examine all of the possible senders and receivers of UTF-16 and then
> defining when they should or shouldn't include a BOM. However, if you
> had a registered charset, call it "marked-utf-16" with definition:
> 
>      Either big-endian UTF-16
>      or a single BOM followed by little-endian UTF-16
> 
> then it would seem to be clear what a sender should send and what
> a receiver should receive, without all of this complex case analysis.

I find both this solution and the two charsets solution to be acceptable.

Here's the ABNF for UTF-16, where the BOM is optional for
network/big-endian byte-order and mandatory for little-endian byte-order. 
This is unambiguously parsable with one octet of lookahead.  If the
little-endian variation is eliminated, then it's unabiguously parsable
without lookahead. 

UTF-16         = UTF-16BE-STR / UTF-16LE-STR

UTF-16BE-STR   = *UTF-16BE-CHAR
UTF-16BE-CHAR  = UTF-16BE-LO / UTF-16BE-HI / UTF-16BE-SUR
UTF-16BE-LO    = (%x00-d7 / %xe0-fe) %x00-ff
UTF-16BE-HI    = %xff %x00-fd
UTF-16BE-SUR   = %xd8-db %x00-ff %xdc-df %x00-ff

UTF-16LE-STR   = %xff %xfe *UTF-16LE-CHAR
UTF-16LE-CHAR  = UTF-16LE-LO / UTF-16LE-HI / UTF-16LE-SUR
UTF-16LE-LO    = %x00-ff (%x00-d7 / %xe0-fe)
UTF-16LE-HI    = %x00-fd %xff
UTF-16LE-SUR   = %x00-ff %xd8-db %x00-ff %xdc-df

Note that this permits the BOM to be part of the data, so the XML spec
would be compliant with this.

I sure wish that Unicode/ISO-10646 had specified that network-byte order
is required for use in files and on networks, then we might not have had
this problem.  This is just repeating the TIFF magic number mistake on a
grander scale. 

		- Chris