[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: How to register 'unicode'/'unicodeFFFE' ?



Hello Leif,

I haven't had enough time to look at the stuff below in detail. It looks 
like a big can of worms :-(. But we have done other registrations in the 
"HTML legacy murkiness" space recently, so let's give this a try and see 
how far we get.

While it is good to have somebody like you championing the registration 
effort, in this case it would be very helpful to get some input from 
Microsoft (I have cc'ed Shawn) and the Unicode Consortium (because of 
the name 'unicode', I cc'ed Patrick who is the official IETF liaison to 
the Unicode Consortium). I can help with that. Input from Paul and 
François (also cc'ed), authors of http://tools.ietf.org/html/rfc2781, 
may also be valuable.

And yes, if you have some registration templates, please don't hesitate 
to send them, it always helps to have something concrete to look at. But 
personally, I'm rather skeptical about adding aliases with murky 
variations to a registration that went through quite a few drafts and 
then became an RFC.

Regards,    Martin.

On 2011/12/15 15:53, Leif Halvard Silli wrote:
> Hi! I am ready to submit . and have prepared - two registrations for
> the 'unicode' and the 'unicodeFFFE' charset. The two charsets are
> variants of 'UTF-16', and they only differ from each others with regard
> to their endianness. Each charset includes the BOM. The registrations
> are based on Microsoft's specifications:
>
> http://msdn.microsoft.com/en-us/library/aa752010(v=VS.85).aspx
> http://msdn.microsoft.com/en-us/library/dd317756(v=VS.85).aspx
>
> The purpose of the registrations would be to 'documents existing
> practice in a large community' and should thus "be explicitly marked as
> being of limited or specialized use and should only be used in Internet
> messages with prior bilateral agreement".
>
> http://tools.ietf.org/html/rfc2978#section-2.5
>
> In an ideal world, 'unicode'/'uniocodeFFFE' would not be necessary to
> register: We have 'UTF-16', for which the endianness can be signalled
> via the BOM. Thus one can switch the endianness freely, without having
> to relabel. For the 'unicode' and 'unicodeFFFE' charsets, by contrast,
> if one changes the endianness, then one must also switch to the other
> (or: another) charset label. This is a mayor reason to not use this
> charset.
>
> So far so good: Because both charsets include the BOM, the BOM takes
> precedence - in particular if the name of the label is not supported by
> the implementation. Opera and Firefox are in that category, for
> example. And even Microsoft seems to be in that league, as e.g. IE has
> no problems handling a little-endian file which includes the
> 'unicodeFFFE' label, as long as the document *also* contains the BOM.
>
> However, Microsoft's spec includes one additional detail which is not
> only impractical but also dangerous: 'utf-16' (formally in lowercase)
> is seen by the Microsoft spec as an alias for 'unicode' - the
> little-endian charset variant. This is of course incompatible with the
> 'UTF-16' charset and so the registrations I have prepared, reject this
> detail. However, for applications that implements the current Microsoft
> specification (such as IE), this still nevertheless mean that if your
> 'text/html' document is big-endian, but without the BOM, and if you
> then send 'UTF-16' via the HTTP Content-Type: charset parameter, then
> you can be certain that IE treats the document as little-endian, with
> 'mojibake' as result.
>
>    (You probably would not like to send 'UTF-16' via HTTP Content-Type,
> though - except as a 'back-up' solution in addition to the BOM, because
> IE does not seem to cache encoding information sent this way. And so
> your document would be misinterpreted if you used the back button. For
> XML, by contrast, then the situation seems better than for 'text/html'
> - perhaps because XML defaults to either UTF-8 or UTF-16.)
>
> As I pondered over this, I first considered that 'unicode' and
> 'unicodeFFFE' had to be registered as aliases for 'UTF-16'. However,
> the fact that each charset supports only one 'subset' of UTF-16 (which
> is a single charset/encoding), meant that it had to be two charsets, if
> the Microsoft reality is taken as basis.
>
> That said: We should support reality, and not Microsoft reality. And in
> that regard: Because both charsets include the BOM, it is simple to
> treat them as aliases for 'UTF-16' - it is only when you create an
> invalid UTF-16 encoding (that is: you omit the BOM) that legacy IE
> risks acting up. (IE always consider BOM before anything else - even
> before HTTP Content-Type, it seems.)
>
> So there are actually two possibilities here: EITHER to update the
> 'UTF-16' registration to also cover 'unicode' and 'unicodeFFFE' - then
> we would also pretty much automatically discourage their use as there
> would be a clear recommendation in place to use the preferred name -
> 'UTF-16' - instead. OR, the other option: To register them as two
> separate charsets.
>
> Making the two labels into aliases of 'UTF-16' would - formally -  give
> them a more prominent status than registering them independently for
> 'limited use'. To register them as aliases, would be to *not* base them
> on 'Microsoft reality'. Such a thing could perhaps make Microsoft align
> itself more with 'UTF-16' as she is registered? Another problem with
> registering them as independent charsets, would be that it would be
> more unclear how non-Microsoft products should handle them. Does anyone
> know if IE10 is behaving any differently w.r.t. UTF-16? Is there a
> direction towards the standard?
>
> To update the UTF-16 registration seems simple - only a matter of
> adding the aliases:
> <http://www.iana.org/assignments/charset-reg/UTF-16>. So I have started
> to, again, consider that the best option.
>
> I had planned to send the registrations now, but I would like to gather
> some responses first. However, if the expert reviewers would like, I
> could post the registrations that I have prepared ASAP - often it is
> better to have something concrete to look at. (That being said, I have
> covered very many of the issues in this message ...)
>
> With regards,
> Leif H Silli
>