[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Revised proposal for UTF-16



I think we are converging but minor differences exist.  Little endian: 
should not or must not?  Is the BOM mandatory or recommended?

1. Harald Alvestrand 

 UTF-16 generators MUST send in big-endian byte order.

 NOTE: Some implementations that do not conform to this specification
 have occasionally sent data in little-endian byte order. When they do
 this, they commonly precede the data with a zero width non breaking
 space (also called Byte Order Mark or BOM) (0xFEFF).
 Thus, an UTF-16 parser encountering the code 0xFFFE as the first
 character of a purported UTF-16 stream may safely assume that he
 has encountered a nonconformant data source.  There is no way to 100% 
reliably detect little-endian data that does not use the BOM.

2. Dan Kegel (in my interpretation)

   UTF-16 generators must begin with the BOM.  They SHOULD [MUST?] NOT send in 
   little-endian byte order, but if they do, they MUST prefix the stream 
   with a little-endian BOM.  UTF-16 consumers MUST assume the default 
   byte-order is big-endian, but MUST also accept little-endian if prefixed 
   with a little-endian BOM.

3. My proposal

I would like to reduce useless options.  Little endian is fine, but it 
should be used only in local environments.  UTF-16 without the BOM is fine, 
but thee should be used only in local evrionments.

Here is my proposal.

 UTF-16 generators MUST send in big-endian byte order and must begin with the 
 zero width non breaking space (also called Byte Order Mark or BOM) (0xFEFF).

 NOTE: Some implementations that do not conform to this specification
 have occasionally sent data in little-endian byte order. When they do
 this, they commonly precede the data with the BOM.
 Thus, an UTF-16 parser encountering the code 0xFFFE as the first
 character of a purported UTF-16 stream may safely assume that he
 has encountered a nonconformant data source.  If the BOM is absent, 
 there is no way to 100% reliably detect little-endian data that does not 
 use the BOM.

Makoto
 
Fuji Xerox Information Systems
 
Tel: +81-44-812-7230   Fax: +81-44-812-7231
E-mail: murata@apsdc.ksp.fujixerox.co.jp