[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: How to register 'unicode'/'unicodeFFFE' ?
Hi Martin,
Yes, agree that it would be good to hear those you contacted,
especially Microsoft.
I sent the two registrations that I had prepared - see separate letters.
It sounded as if you would like to see new charset registrations rather
than adding new aliases for 'UTF-16'. I am a bit back and forth ... But
now I am forth again, as I have been for a while: A good reason to have
separate registrations is to emphasize and clarify how Microsoft and/or
Internet Explorer have implemented the 16bit UTF encodings - it seems
like it is not understood very well. And two, new, independent
registrations, could clarify this, whereas an alias solution would
require the legacy information to be stuffed somewhere - else, too.
Btw, w.r.t. Anne's table,
<http://wiki.whatwg.org/wiki/Web_Encodings#Encodings_3> then my
conclusion is different from his: I may misunderstand the table, but
'unicodeFFFE' is not the same as UTF-16BE, and 'unicode' is not the
same as 'UTF-16' . Fact is that Internet Explorer, AFAICT, does not
support 'UTF-16' (the 'bi-endian encoding'), due to the fact that IE
sees 'utf-16' as alias for 'unicode' (a 'uni-little-endian encoding
with the BOM'). Had it not been for the way they see 'utf-16' as alias
for 'unicode', a unification in the form of making 'unicode' and
'unicodeFFFE' into aliases of 'UTF-16', should have been much more
straight forward.
regards,
Leif H Silli
"Martin J. Dürst", Thu, 15 Dec 2011 18:06:54 +0900:
> Hello Leif,
>
> I haven't had enough time to look at the stuff below in detail. It
> looks like a big can of worms :-(. But we have done other
> registrations in the "HTML legacy murkiness" space recently, so let's
> give this a try and see how far we get.
>
> While it is good to have somebody like you championing the
> registration effort, in this case it would be very helpful to get
> some input from Microsoft (I have cc'ed Shawn) and the Unicode
> Consortium (because of the name 'unicode', I cc'ed Patrick who is the
> official IETF liaison to the Unicode Consortium). I can help with
> that. Input from Paul and François (also cc'ed), authors of
> http://tools.ietf.org/html/rfc2781, may also be valuable.
>
> And yes, if you have some registration templates, please don't
> hesitate to send them, it always helps to have something concrete to
> look at. But personally, I'm rather skeptical about adding aliases
> with murky variations to a registration that went through quite a few
> drafts and then became an RFC.
>
> Regards, Martin.
>
> On 2011/12/15 15:53, Leif Halvard Silli wrote:
>> Hi! I am ready to submit . and have prepared - two registrations for
>> the 'unicode' and the 'unicodeFFFE' charset. The two charsets are
>> variants of 'UTF-16', and they only differ from each others with regard
>> to their endianness. Each charset includes the BOM. The registrations
>> are based on Microsoft's specifications:
>>
>> http://msdn.microsoft.com/en-us/library/aa752010(v=VS.85).aspx
>> http://msdn.microsoft.com/en-us/library/dd317756(v=VS.85).aspx
>>
>> The purpose of the registrations would be to 'documents existing
>> practice in a large community' and should thus "be explicitly marked as
>> being of limited or specialized use and should only be used in Internet
>> messages with prior bilateral agreement".
>>
>> http://tools.ietf.org/html/rfc2978#section-2.5
>>
>> In an ideal world, 'unicode'/'uniocodeFFFE' would not be necessary to
>> register: We have 'UTF-16', for which the endianness can be signalled
>> via the BOM. Thus one can switch the endianness freely, without having
>> to relabel. For the 'unicode' and 'unicodeFFFE' charsets, by contrast,
>> if one changes the endianness, then one must also switch to the other
>> (or: another) charset label. This is a mayor reason to not use this
>> charset.
>>
>> So far so good: Because both charsets include the BOM, the BOM takes
>> precedence - in particular if the name of the label is not supported by
>> the implementation. Opera and Firefox are in that category, for
>> example. And even Microsoft seems to be in that league, as e.g. IE has
>> no problems handling a little-endian file which includes the
>> 'unicodeFFFE' label, as long as the document *also* contains the BOM.
>>
>> However, Microsoft's spec includes one additional detail which is not
>> only impractical but also dangerous: 'utf-16' (formally in lowercase)
>> is seen by the Microsoft spec as an alias for 'unicode' - the
>> little-endian charset variant. This is of course incompatible with the
>> 'UTF-16' charset and so the registrations I have prepared, reject this
>> detail. However, for applications that implements the current Microsoft
>> specification (such as IE), this still nevertheless mean that if your
>> 'text/html' document is big-endian, but without the BOM, and if you
>> then send 'UTF-16' via the HTTP Content-Type: charset parameter, then
>> you can be certain that IE treats the document as little-endian, with
>> 'mojibake' as result.
>>
>> (You probably would not like to send 'UTF-16' via HTTP Content-Type,
>> though - except as a 'back-up' solution in addition to the BOM, because
>> IE does not seem to cache encoding information sent this way. And so
>> your document would be misinterpreted if you used the back button. For
>> XML, by contrast, then the situation seems better than for 'text/html'
>> - perhaps because XML defaults to either UTF-8 or UTF-16.)
>>
>> As I pondered over this, I first considered that 'unicode' and
>> 'unicodeFFFE' had to be registered as aliases for 'UTF-16'. However,
>> the fact that each charset supports only one 'subset' of UTF-16 (which
>> is a single charset/encoding), meant that it had to be two charsets, if
>> the Microsoft reality is taken as basis.
>>
>> That said: We should support reality, and not Microsoft reality. And in
>> that regard: Because both charsets include the BOM, it is simple to
>> treat them as aliases for 'UTF-16' - it is only when you create an
>> invalid UTF-16 encoding (that is: you omit the BOM) that legacy IE
>> risks acting up. (IE always consider BOM before anything else - even
>> before HTTP Content-Type, it seems.)
>>
>> So there are actually two possibilities here: EITHER to update the
>> 'UTF-16' registration to also cover 'unicode' and 'unicodeFFFE' - then
>> we would also pretty much automatically discourage their use as there
>> would be a clear recommendation in place to use the preferred name -
>> 'UTF-16' - instead. OR, the other option: To register them as two
>> separate charsets.
>>
>> Making the two labels into aliases of 'UTF-16' would - formally - give
>> them a more prominent status than registering them independently for
>> 'limited use'. To register them as aliases, would be to *not* base them
>> on 'Microsoft reality'. Such a thing could perhaps make Microsoft align
>> itself more with 'UTF-16' as she is registered? Another problem with
>> registering them as independent charsets, would be that it would be
>> more unclear how non-Microsoft products should handle them. Does anyone
>> know if IE10 is behaving any differently w.r.t. UTF-16? Is there a
>> direction towards the standard?
>>
>> To update the UTF-16 registration seems simple - only a matter of
>> adding the aliases:
>> <http://www.iana.org/assignments/charset-reg/UTF-16>. So I have started
>> to, again, consider that the best option.
>>
>> I had planned to send the registrations now, but I would like to gather
>> some responses first. However, if the expert reviewers would like, I
>> could post the registrations that I have prepared ASAP - often it is
>> better to have something concrete to look at. (That being said, I have
>> covered very many of the issues in this message ...)
>>
>> With regards,
>> Leif H Silli
>>