[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Registration of new charset 'unicode'
Charset name:
'unicode'
Charset aliases:
The 'unicode' spec defines 'utf-16' as its alias, but this of
course contradicts with 'utf-16' as defined in the IANA registry.
Suitability for use in MIME text:
The 'unicode' charset has same MIME text media issue as utf-16.
[1] http://tools.ietf.org/rfc/rfc2781.txt
Published specification(s):
Microsoft's 'Character Set Recognition' document, [1]
together with the 'Code Page Identifiers' document.[2]
[2] http://msdn.microsoft.com/en-us/library/aa752010(v=VS.85).aspx
[3] http://msdn.microsoft.com/en-us/library/dd317756(v=VS.85).aspx
ISO 10646 equivalency table:
The 'unicode' charset represents codepage 1200, whose definition
is: [3] 'Unicode UTF-16, little endian byte order (BMP of ISO
10646);'
Additional information:
* Byte order mark (BOM): The 'unicode' charset specifications do
not explain whether the BOM is required or recommended. However,
without it, products may fail to determine the encoding. And also: The
BOM allows products that do not support 'unicode' to perceive the
encoding as 'utf-16'. Hence it is not very surprising that products
that label documents with the 'unicode' label tend to include the BOM.
Hence, BOM in 'unicode'-encoded documents should be seen as strongly
recommended.
ISSUES to consider before adding support for the 'unicode'
charset in a product:
(1) Document-internal charset declarations: If the label says
'unicode' but the resource (including the BOM) is big-endian, products
(including those that support 'unicode') tend to rely on the BOM and
ignore the charset label. But note that error handling (e.g.
mislabelling, including unknown labels, is a fatal error per XML 1.0
5th edition) and encoding detection (e.g. as described in Appendix F of
XML 1.0 5th edition), could also make the charset label technically
irrelevant. Finally, internal declarations of 16-bit encodings tend to
be without encoding determinative effect in 16-bit encoded documents -
in fact, as labels, the different utf-16 labels tend to be more
effective inside 8-bit encoded documents, where they tend to be treated
like UTF-8 declarations
(2) Document-external encoding declarations:
(a) Products that implement Microsoft's 'unicode' specifications
(in particular Web browsers Internet Explorer and Webkit) in addition
tend to ignore charset info from HTTP for documents that include the
BOM. Whereas current web standards (HTTP, XML 1.0, HTML 4 and HTML 5)
tend to see the charset set by the higher protocol as authoritative.
Thus, adding support for 'unicode' will not increase the product
convergence for those cases when the problem is disagreement about the
order of priority with regard to BOM and HTTP.
(b) Little-endian default: If the BOM is lacking while the
'unicode' label is present (and detected), then products that support
the 'unicode' charset will default to little-endian. (Whereas RFC2781
requires them to default to big-endian.)
(c) The 'utf-16' label: Because the 'unicode' specs define
'utf-16' as an alias for 'unicode', a lacking BOM when the 'utf-16'
label is present (and detected), will cause a little-endian default as
well. This goes against the utf-16 specification [RFC2781], which for
'utf-16' asks for a default to big-endian. (Whereas RFC2781 requires
them to default to big-endian.)
Intended usage:
LIMITED USE. 'Unicode' is used by several Microsoft products
(.NET, Internet Explorer and more) and products that want to be
compatible, such as Webkit.
Person & email address to contact for further information:
Leif Halvard Silli, xn--mlform-iua&xn--mlform-iua.no