[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Registration of new charset [ISO-2022-JP-2004]



> >  > >=?ISO-2022-JP-2004?Q?=1B=24B0B2=2C9=270l=1B=28B?=
>
> > I believe that, in general, many of us recommend being conservative in
> > what you send out, liberal in what you accept. Therefore, the
> > recommendation is to use the charset label that matches the smallest
> > subset of characters actually used in the text, as well as using the
> > oldest and/or most commonly accepted name.
>
> I agree with the sentiment but I'm a bit concerned as to the choice of metric.
> Taken to an extreme, picking a charset that most closely aligns with the
> repetoire used can lead to the use of some very obscure charsets that don't
> enjoy wide support.

True. :-) I should have been more careful with my wording. Maybe one
of the RFCs already says something about choosing the smallest or most
commonly used charset.

> Well, since you kinda sorta brought it up, I have an issue to raise with RFC
> 1468. In the formal syntax, it says:
>
>     single-byte-seq     = ESC "(" ( "B" / "J" )
>     single-byte-segment = single-byte-seq 1*single-byte-char
>
> The "1*" in this turns out to be fairly problematic in how it interacts with
> MIME encoded words. To use the earlier encoded-word as the basis for an
> example, suppose you have a header field containing:
>
>   =?ISO-2022-JP?Q?=1B=24B0B2=2C9=270l=1B=28B?= =?ISO-2022-JP?Q?=1B=24B0B2=2C9=270l=1B=28B?=
>
> According to encoded-word rules, the space between adjacent encoded words is
> supposed to be discarded when decoding. So this decodes to a sequence that has
> this in it:
>
>   ESC ( B ESC $ B
>
> And this is illegal according to the formal grammar, which basically says that
> ESC ( B either has to appear at the end of a segment or else has to be followed
> by some amount of ASCII text. And unfortunately there are implementations out
> there that refuse to decode this. (IMO such implementations are in violation of
> the robustness principle, but they are technically within their rights
> according to the standards.)
>
> Addressing this means that the encoded word decoder (which often operates at a
> completely different level than charset handling) has to be made charset-aware
> enough to know to remove the ESC ( B ESC $ B sequence in its entirety, which is
> more than a little ugly.
>
> The obvious remedy is to change the rule to be:
>
>     single-byte-segment = single-byte-seq 0*single-byte-char
>
> But this then brings up the concern that it will make some currently compliant
> implementations incompliant.
>
> I frankly don't see a good way to fix this. Suggestions?

Don't know whether people would like this solution, but one way would
be to have a spec for decoding and a spec for encoding. When encoding,
it should be 1* and when decoding it can be * (same as 0*). But
perhaps this would interact badly with other parts of the spec?

When encoding encoded-words, each piece of iso-2022-jp text is to be
encoded separately, so there would have to be an ESC ( B or ESC ( J at
the end of each piece, if not already in that state.

Erik