[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Registration of new charset 'unicodeFFFE'
Charset name:
unicodeFFFE
Charset aliases:
No aliases.
Suitability for use in MIME text:
The 'unicodeFFFE' charset labels the big-endian 'subset' of
'UTF-16' and thus shares the same issue: It does 'not encode
line endings in the way required for MIME "text" media'.
[1] http://tools.ietf.org/rfc/rfc2781.txt
Published specification(s):
The 'unicodeFFFE' charset label covers 'codepage 1201':
[2] http://msdn.microsoft.com/en-us/library/aa752010(v=VS.85).aspx
Codepage 1201 covers a big-endian representation of
'UTF-16', including the BOM: 'Unicode UTF-16, big endian byte
order; available only to managed applications'.
[3] http://msdn.microsoft.com/en-us/library/dd317756(v=VS.85).aspx
The reference to 'Unicode UTF-16' is taken to mean that the
BOM MUST be present.
ISO 10646 equivalency table:
The 'unicodeFFFE' charset (codepage 1201) is the big-endian
equivalent to 'unicode' (codepage 1200), which in turn represents
'BMP of ISO 10646'.[2] Thus 'unicodeFFFE' is equivalent of the BMP.
Additional information:
The 'unicodeFFFE' charset can be understood as the big-endian
'subset' of 'UTF-16'. Thus, like 'UTF-16'-encoded resources,
'unicodeFFFE'-encoded resources include the BOM: If the resource
doesn't contain a BOM, then it isn't 'unicodeFFFE'-encoded.
Applications generating resources with the 'unicodeFFFE' label on
(example: <META content="text/html; charset=unicodeFFFE"
http-equiv=Content-Type>), are known to insert the BOM. When parsing
e.g. media of MIME type 'text/html', then Internet Explorer is known
to NOT pick 'unicodeFFFE' (or any other of the 16-bit UTF variants)
as the encoding unless there is a BOM. (Minor exception for
'text/html': If the HTTP Content-Type: header contains 'unicodeFFFE'
in the charset parameter, then IE renders the 'text/html' resource
fine even without a BOM - but only as long as the resource isn't
loaded from cache.)
NB! Alias: At the time of this registration, the spec upon which
the registration of the 'unicodeFFFE' and the 'unicode' charset is
based, defines 'utf-16' (lowercase) as alias for 'unicode'.[2]
This is incompatible with the registered semantics of (uppercase)
'UTF-16' (RFC2781) as it causes implementations - such as Internet
Explorer (IE) - to interpret 'utf-16' (irrespective of case) to mean
'little-endian'. Usually, because a BOM takes precedence (the BOM is
a MUST for both 'unicode', 'unicodeFFFE' and 'UTF-16'), the problem is
solved by the BOM. But otherwise, unless implementations adheres to
the 'unicode'-registration and thus rejects 'utf-16' as alias for
'unicode', then big-endian MIME text resources that are labelled as
'UTF-16' risk being mis-rendered (causing 'mojibake').
Intended usage:
LIMITED USE. It is used by a large community of Microsoft product
users, but is also supported, across different platforms, by products
that want to be compatible. By 'compatible' is meant e.g. tools, such
as editors, in need of determining the encoding or advice about the
best charset label. In that regard: Any resource that can be validly
labeled as 'unicodeFFFE' could also validly (and probably ought to) be
labelled as 'UTF-16'. Another example is the encoding sniffing
algorithm of HTML5, which in certain circumstances require charset
labels that contain 'a UTF-16 encoding' (such as 'unicodeFFFE') as its
value, to be interpreted as if its value instead was 'UTF-8'.
Person & email address to contact for further information:
Leif Halvard Silli, xn--mlform-iua&xn--mlform-iua.no