[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Registration of new charset CP50220



I really don't see how this makes sense for HTML 5.  HTML 5 apps should really be UTF-8.  If this is for some completeness of code pages in an HTML5 world, people should really look at how practical those code pages are.  Sure, there's lots of non-Unicode stuff out there, but presumably HTML 5 is new stuff, or at least with the opportunity to be converted at the authoring side, which would reduce the chance of cross-platform decoding error greatly.

IMO: this registry is interesting for handling existing content, not for streamlining new content.  It's unclear to me how adding this to the registry adds much value to the end users, but if others find it useful, then I don't mind it's inclusion.  This isn't going to magically make some sort of perfect effor free decoding.  "My" code (.Net & Windows) isn't even necessarily consistent throughout the years, and the deviations only get worse when you consider other platforms.  People end up depending on bugs, and then get broken when the "bug" is fixed.

I don't know what the intent of this registration is, and I agree that the encoding / decoding difference might not be interesting here, I just thought it was worth mentioning the behavior :)

-Shawn

 
http://blogs.msdn.com/shawnste


________________________________________
From: "Martin J. Dürst" [duerst@it.aoyama.ac.jp]
Sent: Monday, August 30, 2010 7:53 PM
To: Shawn Steele
Cc: Masatoshi Kimura; NARUSE, Yui; ietf-charsets
Subject: Re: Registration of new charset CP50220

Two comments:

- If what we need (for HTML5, as far as I understand) isn't exactly
   what Windows software is doing, then we should not use the name
   CP50220 for the registration, but should come up with some other
   name. But the origin of strange provisions such as "treat content
   labeled as iso-8859-1 as if it were windows-1252" in HTML5 are
   "because IE did so". So the browsers might as well follow IE exactly,
   not just almost, in which case, we could use the name CP50220.

- The charset registry currently has no way to express "On creation
   (encoding), limited to 'foo', but on interpretation (decoding), also
   take into account 'bar'.". RFC 2978 defines a 'charset' as "a method
   of converting a sequence of octets into a sequence of characters".
   We may be able to deal with this by adding comments, but maybe in the
   long term, this could be a change needed in an update to the RFC.

Regards,    Martin.

On 2010/08/31 8:20, Shawn Steele wrote:
> Windows, .Net&  MLang aren't going to change the behavior of these code pages, it would break people.  Instead we'd encourage customers to use UTF-8, particularly if they're having problems.
>
> I was sort of assuming that since you're using the Windows nomenclature, you're attempting to pin down the behavior for some sort of interoperability when you see the Windows names.  It is, perhaps, odd for the "7 bit" form to do something when it sees 8 bit data, but I was just letting you know that's what it does :)  I'm sure there are also other subtle discrepancies between the 5022x behavior and the official standards, but we're pretty much stuck with the existing behavior.
>
> If Mozilla were to target the Windows CP50220 behavior specifically (as opposed to the more general iso-2022-jp), then how exactly they wanted to follow that behavior would be up to them.  If they thought that just mapping it to iso-2022-jp was acceptable and more convenient, then that would be their choice, same way we may iso-2022-jp to 50220 even though it isn't a perfect match.
>
> -Shawn
>
> -----Original Message-----
> From: Masatoshi Kimura [mailto:VYV03354@nifty.ne.jp]
> Sent: Monday, August 30, 2010 4:07 PM
> To: Shawn Steele
> Cc: NARUSE, Yui; ietf-charsets
> Subject: Re: Registration of new charset CP50220
>
> The purpose of this registration is to "standardize" how to handle errors when Web browsers encount illegal ISO-2022-JP sequences.
> Mozilla encoder has changed a halfwidth katakana handling to match the behavior.
> https://bugzilla.mozilla.org/show_bug.cgi?id=563283
>
>   >  Decoding is identical (which might be most interesting for users>  of tagged content).
> The fist version of the registration had included all decoding methods which is supported by CP50220. (i.e. ESC ( J, SI, and 8bit) However latter two methods were removed from the registration by two reasons.
>
> 1. Some implementation (e.g. Mozilla's one) don't support them.
> Should Mozilla decoder be changed to match the begavior?
>
> 2. The charset supposed to be a 7-bit. It's strange to include a 8-bit character handling.
> Changing the regstration to 8-bit is not a solution because it will require the Content-Transfer-Encoding MIME header field. It is not compatible with ISO-2022-JP. Old Microsoft Internet Mail/News had the bug.
>
>

--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:duerst@it.aoyama.ac.jp