[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Registration of new charset 'unicode'



Martin, thank's for the comments. Here is a new try. Leif H. S.

Charset name:
     'unicode' 

      Hereafter referred to as MS 'unicode'.

Charset aliases:
      none

      NOTE: The published specification mentions 'utf-16' as an alias,
            but this contradicts the registration of 'utf-16'. However,
            this does allow us to recommend sending as 'utf-16' rather
            than sending as the of LIMITED USE MS 'unicode' charset.
            See Additional Notes.

Suitability for use in MIME text:
      Not suitable.

      Reason: The MS 'unicode' charset has same MIME text media issue
      as 'utf-16', see http://tools.ietf.org/rfc/rfc2781.txt

Published specification(s):
      The document 'Character Set Recognition'
      http://msdn.microsoft.com/en-us/library/aa752010(v=VS.85).aspx
      together with the document 'Code Page Identifiers'.
      http://msdn.microsoft.com/en-us/library/dd317756(v=VS.85).aspx

ISO 10646 equivalency table:
      See the published specifications.

Additional information:

    * The label:
      MS 'unicode' label can be seen as representing the little-endian 
'half' - or variant - of 'utf-16'. Thus, unlike 'utf-16', which is a 
suitable label for both the little-endian and the big-endian variant, 
MS 'unicode' can no longer be used if the encoding is switched from 
little-endian to big-endian. Thus, MS 'unicode' is complicated to use 
compared with 'utf-16'. 

    * When to send the MS 'unicode' label: 
      1. First determine whether it is advisable to send any charset 
label at all. For example, starting with HTML5, then conforming, utf-16 
encoded HTML documents are not allowed to contain a document-internal 
encoding declaration. Also: When the BOM is included, the charset is 
self-describing to receivers, and XML does then not require any charset 
label for the particular encoding that MS 'unicode' represents. Note as 
well that even if there is a BOM, XML parsers are nevertheless 
obligated to emit fatal error if the charset label is unknown to the 
parser. They must also emit fatal error if the charset is known but 
differs from the actualy encoding. (Thus, a fatal error would be due if 
MS 'unicode' labelled a big-endian file.) In other applications, then 
an unknown charset might trigger charset defaulting or encoding 
guessing, which might make no difference from not sending any label at 
all.
      2. When a label is wanted, the general rule should be to not send 
the MS 'unicode' label, but to instead include the BOM and send it as 
'utf-16' - see note under Charset aliases above. By doing this one is 
complying with both the MS 'unicode' published specification as well as 
with the 'utf-16' standard.
      3. If one sends the MS 'unicode' label anyway, then one should be 
sure to include the BOM, as this increases the chance that consumers 
might handle it even if the label is unknown.
      4. If one sends the label without the accompanying BOM, and if 
the document is little-endian, as it should be according to the 
published specifications of MS 'unicode', then note that per the 
'utf-16' registration, products are required default to big-endian. 
I.e. this is not advisable.

    * Receiving the MS 'unicode' label:
      (a) The MS 'unicode' label should be treated like 'utf-16', which 
means that receivers should expect to see and interpret the BOM. In XML 
applications, parsers must check that they know the MS 'unicode' 
charset label and if they don't, they must emit fatal error. Also, if 
they know it but the actual encoding does not comply with the label - 
e.g. because the encoding is big-endian, then they must emit fatal 
error as well.
      (b) If the receiver knows the MS 'unicode' label and the label is 
seen in a resource that lacks the BOM, then receivers should treat the 
label as equivalent of 'utf-16le'.
      (c) If the MS 'unicode' label is seen in a 8-bit encoded 
resource, then products should treat it as they would have treated a 
'utf-16' label in the same context. E.g. the HTML5 parser in that 
context requires 'utf-16' label to be replaced with the 'utf-8' label. 

    * Utility of MS 'unicode' in face of receivers that disagree about 
priority and other encoding interpretation details:
      Products that support MS 'unicode' do, at this time, tend to 
disagree with competing products about the order of priority with 
regard to BOM and HTTP. Simply adding MS 'unicode' support without 
aligning the priorities, would in these cases not increase the 
convergence with the competing products. A similar example: Some 
products might prefer sniffing the encoding rather than reading labels.

Intended usage:
      LIMITED USE.

      MS 'Unicode' is added - and interpreted - by several Microsoft 
products (.NET, Internet Explorer, Microsoft Office and more) as well 
as by competitors, such as Webkit.

Person & email address to contact for further information:
      Leif Halvard Silli, xn--mlform-iua&xn--mlform-iua.no