[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Suggested character set policy for the IETF
> > > The first part of your definition, "mapping from octets to characters",
> > > is very widely known and used. The second part of the definition, "related
> > > presentation information", is new to me. Is this your own definition,
> > > or where did you find it? What exactly does the term "presetation
> > > information" mean for you? How do you assure that it means the same
> > > thing for others?
> > The "related presentation information" is a missing portion of the
> > definition. There are things like CRLF, character directionality, Unicode
> > joiner/no-joiners, etc. which effect presentation but are not "characters"
> > in the traditional sense.
> I see. The cases you mention are of course perfectly reasonable and
> necessary. They are also subsumed under the term character in the
> sense it is used in standards, which distinguishes (or should I say
> distinguished?) between control characters and graphic characters.
I respectfully beg to differ. The definition given for "character" in RFC2130
Appendix C is:
Character - A single graphic symbol represented by sequence of one or
more bytes.
I don't know of an earlier definition of "character" in an RFC. (Nathaniel and
I deliberately avoided having one in MIME.) There was a terminology document
floating around some time ago that defined all this stuff but I don't think it
ever became an RFC. And I believe it defined "character" the same way that
RFC2130 does in any case.
Now, there may be some standards group out there that uses the term "character"
consistently to mean "graphic or control character", but if so I don't know
what that group is. (It certainly isn't the ISO, as ISO terminology for this
stuff has flitted all over the place over time.)
Both because of this definition as well as other interoperability issues the
definition the definition of a character set in MIME pretty much has to change.
For one thing, registering UTF-8 as a chaset is technicall illegal right now.
And I happen to despise standards that are worded to allow this sort of clearly
bogus reading, as in general they tend to weaken the standards process.
> > Suggestions for making it more precise would be helpful. It'd be nice to
> > get this right in the next revision of the MIME specification.
> Well, in my oppinion, including something like "presentation" is
> very dangerous. Soon you have people claiming that font information,
> or whatever, has to be part of a "charset". Making the definition
> more precise would be nice, but would probably take too much lines.
> Just leaving it at "characters", and maybe refering to some of the
> ISO work in that area for somebody who really wants to check, should
> be okay.
I'm sorry, but it is not OK, unless you think that not being able to register
UTF-8 under the new rules and not being able to advance MIME to full standard
is OK.
As far as your opinion of the term "presentation" goes, my position is that the
term we use is largely irrelevant, and if makes you happier I'll use "control
information" instead. What matters is that the definition allow this sort of
information as an output of the charset to character conversion process.
We could of course do this by amending the definition of a character in RFC2130
to mean "graphic or control character". But then we're left with the task of
defining a "control character". Because of this I actually prefer language that
equates "character" with "graphic symbol" and talking about the conversion
process also producing control information an output. I think we can get
away with not defining "control information" specifically; I don't think the
same is true for "control character".
One final note about all this. You and others are constantly raising the
spectre of there being a "slippery slope" here that we have to avoid: Once we
allow XXX (presentation information, language tags, take your pick) the doors
will open and all of HTML will end up as a charset, and there's the seventh
seal blown open right there. (I'm exaggerating here, of course, although your
tone sometimes makes me wonder.)
I must say that I for one have no difficulty believing that this is a real
issue for, say, the UTC and the ISO. I'm sure the UTC has seen all sorts of
proposals that attempt to turn Unicode into HTML. Or maybe even PostScript! For
this reason I have no difficulty believing that the UTC has to fight this sort
of stuff off constantly or there will be real trouble for them.
However, that doesn't mean it is a valid issue for the IETF. For one thing,
history says otherwise. The IETF has had a largely unconotrlled charset
registration process in place for well over 5 years now. And a bunch of stuff
has been registered which at a minimum should be marked as "unsuitable for use
in MIME text/plain". Yet in spite of this chaotic history I am unware of anyone
registering a charset that includes, say, general font-switching machinery.
(And it isn't like similar machinery doesn't already exist in ANSI X3.4 under
the general rubric of "control character", BTW.)
In fact the problem the IETF has had with plain text is the exact opposite of
this: We've seen widespread usage where plain text was taken to mean "only the
graphic symbols matter and the rest is trash and should be ignored and yes,
this means you have to reformat everything to fit your display, and yes, when
you then send code or tables through as plain text this reformatting makes it
look like shit".
In other words, while you may believe that the IETF definition of "character"
included "control character" all along, a fair number of other people
effectively did not and worse, acted on this belief, and worse still, their
actions made it into some widely used products. And the result has been serious
trouble and serious interoperability problems -- so much so that I had to
tighten up the prose in the last go-round on MIME to make it clear that _some_
presentation information is present in plain text, when it is there it has to
be acted on, and when it isn't nothing should be done. But I didn't fix the
definition of "charset" to match this, so we now have a standard that says one
thing in one place and another in another place, which isn't acceptable and is
going to have to change.
In other words, I wish you'd stop waving the "font bogey" around, as I don't
think it has any real relevance in the IETF.
Ned