[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: Revised proposal for UTF-16
On Fri, 24 Jul 1998, Larry Masinter wrote:
> I think we're getting into trouble in this case because we're trying
> to examine all of the possible senders and receivers of UTF-16 and then
> defining when they should or shouldn't include a BOM. However, if you
> had a registered charset, call it "marked-utf-16" with definition:
>
> Either big-endian UTF-16
> or a single BOM followed by little-endian UTF-16
>
> then it would seem to be clear what a sender should send and what
> a receiver should receive, without all of this complex case analysis.
I find both this solution and the two charsets solution to be acceptable.
Here's the ABNF for UTF-16, where the BOM is optional for
network/big-endian byte-order and mandatory for little-endian byte-order.
This is unambiguously parsable with one octet of lookahead. If the
little-endian variation is eliminated, then it's unabiguously parsable
without lookahead.
UTF-16 = UTF-16BE-STR / UTF-16LE-STR
UTF-16BE-STR = *UTF-16BE-CHAR
UTF-16BE-CHAR = UTF-16BE-LO / UTF-16BE-HI / UTF-16BE-SUR
UTF-16BE-LO = (%x00-d7 / %xe0-fe) %x00-ff
UTF-16BE-HI = %xff %x00-fd
UTF-16BE-SUR = %xd8-db %x00-ff %xdc-df %x00-ff
UTF-16LE-STR = %xff %xfe *UTF-16LE-CHAR
UTF-16LE-CHAR = UTF-16LE-LO / UTF-16LE-HI / UTF-16LE-SUR
UTF-16LE-LO = %x00-ff (%x00-d7 / %xe0-fe)
UTF-16LE-HI = %x00-fd %xff
UTF-16LE-SUR = %x00-ff %xd8-db %x00-ff %xdc-df
Note that this permits the BOM to be part of the data, so the XML spec
would be compliant with this.
I sure wish that Unicode/ISO-10646 had specified that network-byte order
is required for use in files and on networks, then we might not have had
this problem. This is just repeating the TIFF magic number mistake on a
grander scale.
- Chris