[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: How to register 'unicode'/'unicodeFFFE' ?



Anne van Kesteren, Thu, 15 Dec 2011 12:45:06 +0100:
> On Thu, 15 Dec 2011 07:53:11 +0100, Leif Halvard Silli wrote:
>> Hi! I am ready to submit . and have prepared - two registrations for
>> the 'unicode' and the 'unicodeFFFE' charset. The two charsets are
>> variants of 'UTF-16', and they only differ from each others with regard
>> to their endianness. Each charset includes the BOM. The registrations
>> are based on Microsoft's specifications:
>> 
>> http://msdn.microsoft.com/en-us/library/aa752010(v=VS.85).aspx
>> http://msdn.microsoft.com/en-us/library/dd317756(v=VS.85).aspx
> 
> Are there any other browsers that implement these?

Yes, and no: Only IE implements it 100 percent - AFAICT - as specified 
in the above documents.

But Chrome and Safari do support 'unicode' and 'unicodeFFFE'. But they 
don't treat 'utb-16' as a synonym for 'unicode'. But else: For 
instance, if you have an 8-bit encoded page which says <meta 
charset='UNICODE' > or <meta charset='UNICODEfffe' >, then 
Safari/Webkit will default to UTF-8 instead, just like HTML5 says one 
should. Whereas Firefox and Opera do not know about those charset 
names, and thus does not treat them as being 16-bit encoding 
references, and hence they default to the locale encoding. 

On Mac OS X, then Microsoft Office for Mac uses Webkit/Safari for its 
browser parsing needs. I suspect that this could be one reason for why 
Webkit supports 'unicode' and 'unicodeFFFE', which gets sprinkled into 
your MS Word-generated HTML file if you select 'Unicode' or 'Unicode 
(Big endian)' when saving as HTML.

Another interesting detail is that Webkit *fails* to handle big-endian 
16bit UTF - a bit like IE - if the file does not contain a BOM. (But 
Webkit does it a bit simple for itself: IE *does* render such pages, as 
long as its gets told about the encoding via HTTP Content-Type. 
However, it has some cache problem.s. Opera and Firefox don't have the 
problems of Webkit and IE.)

> Are they in use on any documents?

Yes. Word is used to generate a lot of documents. As it turns out, 
Google Search (and even Google Translate) have problems handling 
UTF-16, at least certain flavors of it. As a result, for - many - 
UTF-16-variant encoded pages, Google Search will as Search result, 
return the source code of the documents rather than its humanly 
consumable content. So this search URL should give you lots of such 
documents:

<http://www.google.no/search?q=%22charset%3Dunicode%22+%22urn%22+&btnG=S%C3%B8k&hl=en&source=hp&gbv=2>

And according to the Opera MAMA project's result, then 'unicode' was 
the 29 most found charset value:

http://devfiles.myopera.com/articles/575/metacenc-url.htm

What I initially was most interested in discovering, was UTF-8 encoded 
pages which erroneously was labelled as charset=unicode or 
charset=unicodeFFFE. I did find 1 or 2 of those, however such pages are 
not as easy to discover via Google - one needs a better research tool.

regards,
Leif Halvard Silli