[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
How to register 'unicode'/'unicodeFFFE' ?
Hi! I am ready to submit . and have prepared - two registrations for
the 'unicode' and the 'unicodeFFFE' charset. The two charsets are
variants of 'UTF-16', and they only differ from each others with regard
to their endianness. Each charset includes the BOM. The registrations
are based on Microsoft's specifications:
http://msdn.microsoft.com/en-us/library/aa752010(v=VS.85).aspx
http://msdn.microsoft.com/en-us/library/dd317756(v=VS.85).aspx
The purpose of the registrations would be to 'documents existing
practice in a large community' and should thus "be explicitly marked as
being of limited or specialized use and should only be used in Internet
messages with prior bilateral agreement".
http://tools.ietf.org/html/rfc2978#section-2.5
In an ideal world, 'unicode'/'uniocodeFFFE' would not be necessary to
register: We have 'UTF-16', for which the endianness can be signalled
via the BOM. Thus one can switch the endianness freely, without having
to relabel. For the 'unicode' and 'unicodeFFFE' charsets, by contrast,
if one changes the endianness, then one must also switch to the other
(or: another) charset label. This is a mayor reason to not use this
charset.
So far so good: Because both charsets include the BOM, the BOM takes
precedence - in particular if the name of the label is not supported by
the implementation. Opera and Firefox are in that category, for
example. And even Microsoft seems to be in that league, as e.g. IE has
no problems handling a little-endian file which includes the
'unicodeFFFE' label, as long as the document *also* contains the BOM.
However, Microsoft's spec includes one additional detail which is not
only impractical but also dangerous: 'utf-16' (formally in lowercase)
is seen by the Microsoft spec as an alias for 'unicode' - the
little-endian charset variant. This is of course incompatible with the
'UTF-16' charset and so the registrations I have prepared, reject this
detail. However, for applications that implements the current Microsoft
specification (such as IE), this still nevertheless mean that if your
'text/html' document is big-endian, but without the BOM, and if you
then send 'UTF-16' via the HTTP Content-Type: charset parameter, then
you can be certain that IE treats the document as little-endian, with
'mojibake' as result.
(You probably would not like to send 'UTF-16' via HTTP Content-Type,
though - except as a 'back-up' solution in addition to the BOM, because
IE does not seem to cache encoding information sent this way. And so
your document would be misinterpreted if you used the back button. For
XML, by contrast, then the situation seems better than for 'text/html'
- perhaps because XML defaults to either UTF-8 or UTF-16.)
As I pondered over this, I first considered that 'unicode' and
'unicodeFFFE' had to be registered as aliases for 'UTF-16'. However,
the fact that each charset supports only one 'subset' of UTF-16 (which
is a single charset/encoding), meant that it had to be two charsets, if
the Microsoft reality is taken as basis.
That said: We should support reality, and not Microsoft reality. And in
that regard: Because both charsets include the BOM, it is simple to
treat them as aliases for 'UTF-16' - it is only when you create an
invalid UTF-16 encoding (that is: you omit the BOM) that legacy IE
risks acting up. (IE always consider BOM before anything else - even
before HTTP Content-Type, it seems.)
So there are actually two possibilities here: EITHER to update the
'UTF-16' registration to also cover 'unicode' and 'unicodeFFFE' - then
we would also pretty much automatically discourage their use as there
would be a clear recommendation in place to use the preferred name -
'UTF-16' - instead. OR, the other option: To register them as two
separate charsets.
Making the two labels into aliases of 'UTF-16' would - formally - give
them a more prominent status than registering them independently for
'limited use'. To register them as aliases, would be to *not* base them
on 'Microsoft reality'. Such a thing could perhaps make Microsoft align
itself more with 'UTF-16' as she is registered? Another problem with
registering them as independent charsets, would be that it would be
more unclear how non-Microsoft products should handle them. Does anyone
know if IE10 is behaving any differently w.r.t. UTF-16? Is there a
direction towards the standard?
To update the UTF-16 registration seems simple - only a matter of
adding the aliases:
<http://www.iana.org/assignments/charset-reg/UTF-16>. So I have started
to, again, consider that the best option.
I had planned to send the registrations now, but I would like to gather
some responses first. However, if the expert reviewers would like, I
could post the registrations that I have prepared ASAP - often it is
better to have something concrete to look at. (That being said, I have
covered very many of the issues in this message ...)
With regards,
Leif H Silli