[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Registration of new charset 'unicode'



Charset name:
     'unicode'

Charset aliases:
      The 'unicode' spec defines 'utf-16' as its alias, but this of
      course contradicts with 'utf-16' as defined in the IANA registry.

Suitability for use in MIME text:
      The 'unicode' charset has same MIME text media issue as utf-16.
  [1] http://tools.ietf.org/rfc/rfc2781.txt

Published specification(s):
      Microsoft's 'Character Set Recognition' document, [1]
      together with the 'Code Page Identifiers' document.[2]
  [2] http://msdn.microsoft.com/en-us/library/aa752010(v=VS.85).aspx
  [3] http://msdn.microsoft.com/en-us/library/dd317756(v=VS.85).aspx

ISO 10646 equivalency table:
      The 'unicode' charset represents codepage 1200, whose definition
      is: [3] 'Unicode UTF-16, little endian byte order (BMP of ISO 
      10646);'

Additional information:
    * Byte order mark (BOM): The 'unicode' charset specifications do 
not explain whether the BOM is required or recommended. However, 
without it, products may fail to determine the encoding. And also: The 
BOM allows products that do not support 'unicode' to perceive the 
encoding as 'utf-16'. Hence it is not very surprising that products 
that label documents with the 'unicode' label tend to include the BOM. 
Hence, BOM in 'unicode'-encoded documents should be seen as strongly 
recommended.
      ISSUES to consider before adding support for the 'unicode' 
charset in a product:
  (1) Document-internal charset declarations: If the label says 
'unicode' but the resource (including the BOM) is big-endian, products 
(including those that support 'unicode') tend to rely on the BOM and 
ignore the charset label. But note that error handling (e.g. 
mislabelling, including unknown labels, is a fatal error per XML 1.0 
5th edition) and encoding detection (e.g. as described in Appendix F of 
XML 1.0 5th edition), could also make the charset label technically 
irrelevant. Finally, internal declarations of 16-bit encodings tend to 
be without encoding determinative effect in 16-bit encoded documents - 
in fact, as labels, the different utf-16 labels tend to be more 
effective inside 8-bit encoded documents, where they tend to be treated 
like UTF-8 declarations
  (2) Document-external encoding declarations:
      (a) Products that implement Microsoft's 'unicode' specifications 
(in particular Web browsers Internet Explorer and Webkit) in addition 
tend to ignore charset info from HTTP for documents that include the 
BOM. Whereas current web standards (HTTP, XML 1.0, HTML 4 and HTML 5) 
tend to see the charset set by the higher protocol as authoritative. 
Thus, adding support for 'unicode' will not increase the product 
convergence for those cases when the problem is disagreement about the 
order of priority with regard to BOM and HTTP.
      (b) Little-endian default: If the BOM is lacking while the 
'unicode' label is present (and detected), then products that support 
the 'unicode' charset  will default to little-endian. (Whereas RFC2781 
requires them to default to big-endian.) 
      (c) The 'utf-16' label: Because the 'unicode' specs define 
'utf-16' as an alias for 'unicode', a lacking BOM when the 'utf-16' 
label is present (and detected), will cause a little-endian default as 
well. This goes against the utf-16 specification [RFC2781], which for 
'utf-16' asks for a default to big-endian. (Whereas RFC2781 requires 
them to default to big-endian.)
 
Intended usage:
      LIMITED USE. 'Unicode' is used by several Microsoft products 
(.NET, Internet Explorer and more) and products that want to be 
compatible, such as Webkit.

Person & email address to contact for further information:
      Leif Halvard Silli, xn--mlform-iua&xn--mlform-iua.no