[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Registration of new charset 'unicode'
Hello Leif,
Below are some comments. Please also incorporate the proposals given in
other mail, they all look good to me.
On 2011/12/20 8:53, Leif Halvard Silli wrote:
> Charset name:
> 'unicode'
>
> Charset aliases:
> The 'unicode' spec defines 'utf-16' as its alias, but this of
> course contradicts with 'utf-16' as defined in the IANA registry.
Please reword along the foollowing linkes:
Charset aliases:
None.
Note: Documents by Microsoft mention 'utf-16' as an alias,
but this contradicts the registration of 'utf-16'.
(Main points: Put the actual information first, the explanations later,
and separate them clearly. Don't use the word 'spec' for the Microsoft
side. Don't use "of course" and similar argumentative wording.)
> Suitability for use in MIME text:
> The 'unicode' charset has same MIME text media issue as utf-16.
> [1] http://tools.ietf.org/rfc/rfc2781.txt
Again, put the actual information first. The reader should not have to
check another document just to get this information.
> Published specification(s):
> Microsoft's 'Character Set Recognition' document, [1]
> together with the 'Code Page Identifiers' document.[2]
> [2] http://msdn.microsoft.com/en-us/library/aa752010(v=VS.85).aspx
> [3] http://msdn.microsoft.com/en-us/library/dd317756(v=VS.85).aspx
There seems to be some mixup with the numbers in {}. Just don't use
them, give URIs in the text itself.
> ISO 10646 equivalency table:
> The 'unicode' charset represents codepage 1200, whose definition
> is: [3] 'Unicode UTF-16, little endian byte order (BMP of ISO
> 10646);'
I'd just say "see published specification".
> Additional information:
> * Byte order mark (BOM): The 'unicode' charset specifications do
> not explain whether the BOM is required or recommended. However,
> without it, products may fail to determine the encoding. And also: The
> BOM allows products that do not support 'unicode' to perceive the
> encoding as 'utf-16'. Hence it is not very surprising that products
> that label documents with the 'unicode' label tend to include the BOM.
> Hence, BOM in 'unicode'-encoded documents should be seen as strongly
> recommended.
Please remove argumentation such as "it is not very surprising" or "and
also".
Also, I'd separate advice for generation and for reception. E.g. say:
- When sending content, instead of using 'unicode', use 'utf-16' and
make sure that you have a BOM.
and so on. This also applies to the text below.
> ISSUES to consider before adding support for the 'unicode'
> charset in a product:
> (1) Document-internal charset declarations: If the label says
> 'unicode' but the resource (including the BOM) is big-endian, products
> (including those that support 'unicode') tend to rely on the BOM and
> ignore the charset label. But note that error handling (e.g.
> mislabelling, including unknown labels, is a fatal error per XML 1.0
> 5th edition) and encoding detection (e.g. as described in Appendix F of
> XML 1.0 5th edition), could also make the charset label technically
> irrelevant. Finally, internal declarations of 16-bit encodings tend to
> be without encoding determinative effect in 16-bit encoded documents -
> in fact, as labels, the different utf-16 labels tend to be more
> effective inside 8-bit encoded documents, where they tend to be treated
> like UTF-8 declarations
> (2) Document-external encoding declarations:
> (a) Products that implement Microsoft's 'unicode' specifications
> (in particular Web browsers Internet Explorer and Webkit) in addition
> tend to ignore charset info from HTTP for documents that include the
> BOM. Whereas current web standards (HTTP, XML 1.0, HTML 4 and HTML 5)
> tend to see the charset set by the higher protocol as authoritative.
> Thus, adding support for 'unicode' will not increase the product
> convergence for those cases when the problem is disagreement about the
> order of priority with regard to BOM and HTTP.
> (b) Little-endian default: If the BOM is lacking while the
> 'unicode' label is present (and detected), then products that support
> the 'unicode' charset will default to little-endian. (Whereas RFC2781
> requires them to default to big-endian.)
> (c) The 'utf-16' label: Because the 'unicode' specs define
> 'utf-16' as an alias for 'unicode', a lacking BOM when the 'utf-16'
> label is present (and detected), will cause a little-endian default as
> well. This goes against the utf-16 specification [RFC2781], which for
> 'utf-16' asks for a default to big-endian. (Whereas RFC2781 requires
> them to default to big-endian.)
>
> Intended usage:
> LIMITED USE. 'Unicode' is used by several Microsoft products
> (.NET, Internet Explorer and more) and products that want to be
> compatible, such as Webkit.
Separate "LIMITED USE" and the remaining text by a line.
Regards, Martin.
> Person& email address to contact for further information:
> Leif Halvard Silli, xn--mlform-iua&xn--mlform-iua.no
>