[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

How to register 'unicode'/'unicodeFFFE' ?



Hi! I am ready to submit . and have prepared - two registrations for 
the 'unicode' and the 'unicodeFFFE' charset. The two charsets are 
variants of 'UTF-16', and they only differ from each others with regard 
to their endianness. Each charset includes the BOM. The registrations 
are based on Microsoft's specifications:

http://msdn.microsoft.com/en-us/library/aa752010(v=VS.85).aspx
http://msdn.microsoft.com/en-us/library/dd317756(v=VS.85).aspx

The purpose of the registrations would be to 'documents existing 
practice in a large community' and should thus "be explicitly marked as 
being of limited or specialized use and should only be used in Internet 
messages with prior bilateral agreement".

http://tools.ietf.org/html/rfc2978#section-2.5

In an ideal world, 'unicode'/'uniocodeFFFE' would not be necessary to 
register: We have 'UTF-16', for which the endianness can be signalled 
via the BOM. Thus one can switch the endianness freely, without having 
to relabel. For the 'unicode' and 'unicodeFFFE' charsets, by contrast, 
if one changes the endianness, then one must also switch to the other 
(or: another) charset label. This is a mayor reason to not use this 
charset.

So far so good: Because both charsets include the BOM, the BOM takes 
precedence - in particular if the name of the label is not supported by 
the implementation. Opera and Firefox are in that category, for 
example. And even Microsoft seems to be in that league, as e.g. IE has 
no problems handling a little-endian file which includes the 
'unicodeFFFE' label, as long as the document *also* contains the BOM.

However, Microsoft's spec includes one additional detail which is not 
only impractical but also dangerous: 'utf-16' (formally in lowercase) 
is seen by the Microsoft spec as an alias for 'unicode' - the 
little-endian charset variant. This is of course incompatible with the 
'UTF-16' charset and so the registrations I have prepared, reject this 
detail. However, for applications that implements the current Microsoft 
specification (such as IE), this still nevertheless mean that if your 
'text/html' document is big-endian, but without the BOM, and if you 
then send 'UTF-16' via the HTTP Content-Type: charset parameter, then 
you can be certain that IE treats the document as little-endian, with 
'mojibake' as result.

  (You probably would not like to send 'UTF-16' via HTTP Content-Type, 
though - except as a 'back-up' solution in addition to the BOM, because 
IE does not seem to cache encoding information sent this way. And so 
your document would be misinterpreted if you used the back button. For 
XML, by contrast, then the situation seems better than for 'text/html' 
- perhaps because XML defaults to either UTF-8 or UTF-16.)

As I pondered over this, I first considered that 'unicode' and 
'unicodeFFFE' had to be registered as aliases for 'UTF-16'. However, 
the fact that each charset supports only one 'subset' of UTF-16 (which 
is a single charset/encoding), meant that it had to be two charsets, if 
the Microsoft reality is taken as basis.

That said: We should support reality, and not Microsoft reality. And in 
that regard: Because both charsets include the BOM, it is simple to 
treat them as aliases for 'UTF-16' - it is only when you create an 
invalid UTF-16 encoding (that is: you omit the BOM) that legacy IE 
risks acting up. (IE always consider BOM before anything else - even 
before HTTP Content-Type, it seems.)

So there are actually two possibilities here: EITHER to update the 
'UTF-16' registration to also cover 'unicode' and 'unicodeFFFE' - then 
we would also pretty much automatically discourage their use as there 
would be a clear recommendation in place to use the preferred name -  
'UTF-16' - instead. OR, the other option: To register them as two 
separate charsets.

Making the two labels into aliases of 'UTF-16' would - formally -  give 
them a more prominent status than registering them independently for 
'limited use'. To register them as aliases, would be to *not* base them 
on 'Microsoft reality'. Such a thing could perhaps make Microsoft align 
itself more with 'UTF-16' as she is registered? Another problem with 
registering them as independent charsets, would be that it would be 
more unclear how non-Microsoft products should handle them. Does anyone 
know if IE10 is behaving any differently w.r.t. UTF-16? Is there a 
direction towards the standard?

To update the UTF-16 registration seems simple - only a matter of 
adding the aliases: 
<http://www.iana.org/assignments/charset-reg/UTF-16>. So I have started 
to, again, consider that the best option.

I had planned to send the registrations now, but I would like to gather 
some responses first. However, if the expert reviewers would like, I 
could post the registrations that I have prepared ASAP - often it is 
better to have something concrete to look at. (That being said, I have 
covered very many of the issues in this message ...)

With regards,
Leif H Silli