[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Registration of new charset [ISO-2022-JP-2004]



 
 
 
>                   Hello,

> I have a few questions about this registration:

>  > At 00:28 06/09/28, Koichi Yasuoka wrote:
>  > >=?ISO-2022-JP-2004?Q?=1B=24B0B2=2C9=270l=1B=28B?=

> I believe that, in general, many of us recommend being conservative in
> what you send out, liberal in what you accept. Therefore, the
> recommendation is to use the charset label that matches the smallest
> subset of characters actually used in the text, as well as using the
> oldest and/or most commonly accepted name.

I agree with the sentiment but I'm a bit concerned as to the choice of metric.
Taken to an extreme, picking a charset that most closely aligns with the
repetoire used can lead to the use of some very obscure charsets that don't
enjoy wide support. I think a more reasonable approach is to try and choose a
charset based both on its alignment with what you're doing and how well
supported it is. Hopefully as times goes on and UTF-8 support becomes truly
ubituitous it will become the charset of choice, never mind the fact that most
uses of it won't involve more than a small fraction of its repetoire.

> In this case, you are
> clearly using the ESC $ B (1B 24 42) that is part of iso-2022-jp (rfc
> 1468). Therefore, the more conservative option is to use the name
> iso-2022-jp when sending this particular piece of text.

Yes, in this specific case it would be better to use iso-2022-jp.

> I have noticed over the years that if you don't spell out the
> recommendations, implementors will do the wrong thing. In this case,
> would it be a good idea to add such recommendations to the
> registration itself? Or should a new RFC be written, in order to
> provide the recommendations in more detail?

I certainly would have no problem with such a recommendation being part of the
registration, although I'm not quite sure how I'd embody such a recommendation
in an actual implementation.

>  > >I understand that ISO-2022-JP texts with "ESC $ B" and
>  > >"ESC ( B" can be accepted by ISO-2022-JP-2004 decoder.
>  > >It is problematic when "ESC $ @" or "ESC ( J" is used but
>  > >they are very rare now.

> Which escape sequences are permitted in iso-2022-jp-2004? There are 3
> problems with the link you sent earlier*: The first page is in
> Japanese, and when you search for X0213, the results are in Japanese
> too. Then X0213 is split into many PDFs, and it is not clear which one
> to download in order to see the escape sequences, nor am I inclined to
> download all of the pieces. Finally, that site was down yesterday and
> up today. How often does it go down?

> * http://www.jisc.go.jp/app/JPS/JPSO0020.html

>  > >Now I know "ISO-2022-JP-2004 vs Unicode
>  > >mapping table" at http://x0213.org/codetable/iso-2022-jp-2004-std.txt

> I wonder whether either or both of these links would be good to have
> in the registration:

> http://www.itscj.ipsj.or.jp/ISO-IR/233.pdf
> http://www.itscj.ipsj.or.jp/ISO-IR/

> Erik van der Poel
> Editor and co-author of RFC 1468 (iso-2022-jp)

Well, since you kinda sorta brought it up, I have an issue to raise with RFC
1468. In the formal syntax, it says:

    single-byte-seq     = ESC "(" ( "B" / "J" )
    single-byte-segment = single-byte-seq 1*single-byte-char

The "1*" in this turns out to be fairly problematic in how it interacts with
MIME encoded words. To use the earlier encoded-word as the basis for an
example, suppose you have a header field containing:

  =?ISO-2022-JP?Q?=1B=24B0B2=2C9=270l=1B=28B?= =?ISO-2022-JP?Q?=1B=24B0B2=2C9=270l=1B=28B?=

According to encoded-word rules, the space between adjacent encoded words is
supposed to be discarded when decoding. So this decodes to a sequence that has
this in it:

  ESC ( B ESC $ B

And this is illegal according to the formal grammar, which basically says that
ESC ( B either has to appear at the end of a segment or else has to be followed
by some amount of ASCII text. And unfortunately there are implementations out
there that refuse to decode this. (IMO such implementations are in violation of
the robustness principle, but they are technically within their rights
according to the standards.)

Addressing this means that the encoded word decoder (which often operates at a
completely different level than charset handling) has to be made charset-aware
enough to know to remove the ESC ( B ESC $ B sequence in its entirety, which is
more than a little ugly.

The obvious remedy is to change the rule to be:

    single-byte-segment = single-byte-seq 0*single-byte-char

But this then brings up the concern that it will make some currently compliant
implementations incompliant.

I frankly don't see a good way to fix this. Suggestions?

				Ned