[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Registration of new charset 'unicode'
Martin, thank's for the comments. Here is a new try. Leif H. S.
Charset name:
'unicode'
Hereafter referred to as MS 'unicode'.
Charset aliases:
none
NOTE: The published specification mentions 'utf-16' as an alias,
but this contradicts the registration of 'utf-16'. However,
this does allow us to recommend sending as 'utf-16' rather
than sending as the of LIMITED USE MS 'unicode' charset.
See Additional Notes.
Suitability for use in MIME text:
Not suitable.
Reason: The MS 'unicode' charset has same MIME text media issue
as 'utf-16', see http://tools.ietf.org/rfc/rfc2781.txt
Published specification(s):
The document 'Character Set Recognition'
http://msdn.microsoft.com/en-us/library/aa752010(v=VS.85).aspx
together with the document 'Code Page Identifiers'.
http://msdn.microsoft.com/en-us/library/dd317756(v=VS.85).aspx
ISO 10646 equivalency table:
See the published specifications.
Additional information:
* The label:
MS 'unicode' label can be seen as representing the little-endian
'half' - or variant - of 'utf-16'. Thus, unlike 'utf-16', which is a
suitable label for both the little-endian and the big-endian variant,
MS 'unicode' can no longer be used if the encoding is switched from
little-endian to big-endian. Thus, MS 'unicode' is complicated to use
compared with 'utf-16'.
* When to send the MS 'unicode' label:
1. First determine whether it is advisable to send any charset
label at all. For example, starting with HTML5, then conforming, utf-16
encoded HTML documents are not allowed to contain a document-internal
encoding declaration. Also: When the BOM is included, the charset is
self-describing to receivers, and XML does then not require any charset
label for the particular encoding that MS 'unicode' represents. Note as
well that even if there is a BOM, XML parsers are nevertheless
obligated to emit fatal error if the charset label is unknown to the
parser. They must also emit fatal error if the charset is known but
differs from the actualy encoding. (Thus, a fatal error would be due if
MS 'unicode' labelled a big-endian file.) In other applications, then
an unknown charset might trigger charset defaulting or encoding
guessing, which might make no difference from not sending any label at
all.
2. When a label is wanted, the general rule should be to not send
the MS 'unicode' label, but to instead include the BOM and send it as
'utf-16' - see note under Charset aliases above. By doing this one is
complying with both the MS 'unicode' published specification as well as
with the 'utf-16' standard.
3. If one sends the MS 'unicode' label anyway, then one should be
sure to include the BOM, as this increases the chance that consumers
might handle it even if the label is unknown.
4. If one sends the label without the accompanying BOM, and if
the document is little-endian, as it should be according to the
published specifications of MS 'unicode', then note that per the
'utf-16' registration, products are required default to big-endian.
I.e. this is not advisable.
* Receiving the MS 'unicode' label:
(a) The MS 'unicode' label should be treated like 'utf-16', which
means that receivers should expect to see and interpret the BOM. In XML
applications, parsers must check that they know the MS 'unicode'
charset label and if they don't, they must emit fatal error. Also, if
they know it but the actual encoding does not comply with the label -
e.g. because the encoding is big-endian, then they must emit fatal
error as well.
(b) If the receiver knows the MS 'unicode' label and the label is
seen in a resource that lacks the BOM, then receivers should treat the
label as equivalent of 'utf-16le'.
(c) If the MS 'unicode' label is seen in a 8-bit encoded
resource, then products should treat it as they would have treated a
'utf-16' label in the same context. E.g. the HTML5 parser in that
context requires 'utf-16' label to be replaced with the 'utf-8' label.
* Utility of MS 'unicode' in face of receivers that disagree about
priority and other encoding interpretation details:
Products that support MS 'unicode' do, at this time, tend to
disagree with competing products about the order of priority with
regard to BOM and HTTP. Simply adding MS 'unicode' support without
aligning the priorities, would in these cases not increase the
convergence with the competing products. A similar example: Some
products might prefer sniffing the encoding rather than reading labels.
Intended usage:
LIMITED USE.
MS 'Unicode' is added - and interpreted - by several Microsoft
products (.NET, Internet Explorer, Microsoft Office and more) as well
as by competitors, such as Webkit.
Person & email address to contact for further information:
Leif Halvard Silli, xn--mlform-iua&xn--mlform-iua.no