[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Registration of new charset 'unicode'



Hello Leif,

Below are some comments. Please also incorporate the proposals given in 
other mail, they all look good to me.

On 2011/12/20 8:53, Leif Halvard Silli wrote:
> Charset name:
>       'unicode'
>
> Charset aliases:
>        The 'unicode' spec defines 'utf-16' as its alias, but this of
>        course contradicts with 'utf-16' as defined in the IANA registry.

Please reword along the foollowing linkes:

Charset aliases:
          None.

          Note: Documents by Microsoft mention 'utf-16' as an alias,
          but this contradicts the registration of 'utf-16'.

(Main points: Put the actual information first, the explanations later, 
and separate them clearly. Don't use the word 'spec' for the Microsoft 
side. Don't use "of course" and similar argumentative wording.)


> Suitability for use in MIME text:
>        The 'unicode' charset has same MIME text media issue as utf-16.
>    [1] http://tools.ietf.org/rfc/rfc2781.txt

Again, put the actual information first. The reader should not have to 
check another document just to get this information.


> Published specification(s):
>        Microsoft's 'Character Set Recognition' document, [1]
>        together with the 'Code Page Identifiers' document.[2]
>    [2] http://msdn.microsoft.com/en-us/library/aa752010(v=VS.85).aspx
>    [3] http://msdn.microsoft.com/en-us/library/dd317756(v=VS.85).aspx

There seems to be some mixup with the numbers in {}. Just don't use 
them, give URIs in the text itself.


> ISO 10646 equivalency table:
>        The 'unicode' charset represents codepage 1200, whose definition
>        is: [3] 'Unicode UTF-16, little endian byte order (BMP of ISO
>        10646);'

I'd just say "see published specification".

> Additional information:
>      * Byte order mark (BOM): The 'unicode' charset specifications do
> not explain whether the BOM is required or recommended. However,
> without it, products may fail to determine the encoding. And also: The
> BOM allows products that do not support 'unicode' to perceive the
> encoding as 'utf-16'. Hence it is not very surprising that products
> that label documents with the 'unicode' label tend to include the BOM.
> Hence, BOM in 'unicode'-encoded documents should be seen as strongly
> recommended.

Please remove argumentation such as "it is not very surprising" or "and 
also".

Also, I'd separate advice for generation and for reception. E.g. say:
- When sending content, instead of using 'unicode', use 'utf-16' and 
make sure that you have a BOM.

and so on. This also applies to the text below.


>        ISSUES to consider before adding support for the 'unicode'
> charset in a product:
>    (1) Document-internal charset declarations: If the label says
> 'unicode' but the resource (including the BOM) is big-endian, products
> (including those that support 'unicode') tend to rely on the BOM and
> ignore the charset label. But note that error handling (e.g.
> mislabelling, including unknown labels, is a fatal error per XML 1.0
> 5th edition) and encoding detection (e.g. as described in Appendix F of
> XML 1.0 5th edition), could also make the charset label technically
> irrelevant. Finally, internal declarations of 16-bit encodings tend to
> be without encoding determinative effect in 16-bit encoded documents -
> in fact, as labels, the different utf-16 labels tend to be more
> effective inside 8-bit encoded documents, where they tend to be treated
> like UTF-8 declarations
>    (2) Document-external encoding declarations:
>        (a) Products that implement Microsoft's 'unicode' specifications
> (in particular Web browsers Internet Explorer and Webkit) in addition
> tend to ignore charset info from HTTP for documents that include the
> BOM. Whereas current web standards (HTTP, XML 1.0, HTML 4 and HTML 5)
> tend to see the charset set by the higher protocol as authoritative.
> Thus, adding support for 'unicode' will not increase the product
> convergence for those cases when the problem is disagreement about the
> order of priority with regard to BOM and HTTP.
>        (b) Little-endian default: If the BOM is lacking while the
> 'unicode' label is present (and detected), then products that support
> the 'unicode' charset  will default to little-endian. (Whereas RFC2781
> requires them to default to big-endian.)
>        (c) The 'utf-16' label: Because the 'unicode' specs define
> 'utf-16' as an alias for 'unicode', a lacking BOM when the 'utf-16'
> label is present (and detected), will cause a little-endian default as
> well. This goes against the utf-16 specification [RFC2781], which for
> 'utf-16' asks for a default to big-endian. (Whereas RFC2781 requires
> them to default to big-endian.)
>
> Intended usage:
>        LIMITED USE. 'Unicode' is used by several Microsoft products
> (.NET, Internet Explorer and more) and products that want to be
> compatible, such as Webkit.

Separate "LIMITED USE" and the remaining text by a line.

Regards,   Martin.


> Person&  email address to contact for further information:
>        Leif Halvard Silli, xn--mlform-iua&xn--mlform-iua.no
>