[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: CHARSET considerations

To: [email protected], [email protected], [email protected]
Subject: RE: CHARSET considerations
From: Rick Troth <[email protected]>
Date: Tue, 18 May 1993 15:31:15 -0500 (CDT)
Cc: [email protected]
In-reply-to: Message of Fri, 14 May 93 17:29:24 -0400 from <[email protected]>
Resent-message-id: <[email protected]>

On Fri, 14 May 93 17:29:24 -0400 Steve said:
>In <[email protected]>, Rick wrote:
>>         Any user of Pine 3.05 (and as far as I can tell 3.07 or 2.x)
>> can shoot themself in the foot  (head if you prefer)  by setting
>> character-set = Zeldas_private_codepage.
>
>This is almost certainly a bad idea,   ...

        Although I used this to defend my action of having used an
illegitimate CHARSET,  I do  NOT  think that all  "user can shoot
themself in the foot"  features are bad.   Specifically,  I feel
(quite strongly)  that the user should be able to specify any old
charset and have display at least attempted at the other end.

        The long term solution is,  of course,  to map between
"character sets"  (which the use should have control over)  and
"charsets"  (which the user should leave alone).

        My only request of Pine from all this noise is that Pine
NOT LABEL  messages of  Content-Type:  text/plain.
(this may be counter to RFC 1341;  is it?)

>> Should the Pine developers remove this feature?

        No.

>                  charset is an octet-based encoding used during
>message transfer; it need bear no relation to the composing or
>viewing character sets.

        Right.   I maintain that CHARSET specification should be
omitted when feasible.   This is because there are such things as
gateways which translate the SMTP octet-stream into anything.

        There are two goals:  1)  to be able to specify new and/or
extended character sets  (and mark-ups and other extensions to plain text)
and  2)  to use  "plain text"  (in mail)  as a transport medium.
For the former,  use  Base64  encoding when needed.   For the latter,
don't label the text  "ASCII"  or any other codepoint mapping if there's
any way on earth that it might get translated by a gateway.

        I don't think this is making sense and I can't find the words.
Steve apparently has:   charset -vs- character_set.

        Plain text  is defined differently from system to system.
On UNIX,  plain text is ASCII (now ISO-8859-1) with lines delimited by
NL (actually LF).   On NT,  plain text is 16 bits wide  (so I hear).
That ain't ASCII,  though we could be the high-order 8 bits for much
of plain text processing,  and that's fine by me.   (memory is cheap)
On VM/CMS,  plain text is EBCDIC (now CodePage 1047) and records are
handled by the filesystem out-of-band of the data,  so NL (and LF and CR)
aren't sacred characters.   Now ... "mail is plain-text,  not ASCII".

>                         In the most general case, a message will
>be composed using some native character set, translated
>automatically to a MIME-registered charset, and translated at the
>other end into a native display character set.

        Right!   99 times out of 100 you don't care,  but there's that
1% of the time when you've called it  US-ASCII  and it's  NOT anymore,
although it  *is*  still legitimate  "plain text".

>           (You'll notice that I reinforce this distinction in my
>own head and in this message by using the terms "character set"
>and "charset" noninterchangeably.)

        Thanks.   That helps.

>The charset situation is much like the canonical CRLF situation:
>the fact that the canonical representation is identical to some
>but not all of the available local representations guarantees
>misunderstandings.

        Right!   And this thinking,  carried into MIME  (thus this
should be kicked BACK TO the IETF-822 list,  but I refrain),  shows up
in the use of  CHARSET=ISO-8859-1  rather than  CHARACTER_SET=Latin-1.
If you specify  "Latin-1",  then you can  (must;  I'm arguing for a
definition here,  not an explanation)  assume that  SMTP  will carry it
as ISO-8859-1,  BUT THE RECEIVING  (or sending)  HOST MIGHT NOT.
(and yes,  sad but true,  any SMTPs will strip the high bit)

>To be sure, automated selection of and translation to a registered
>MIME charset is a non-trivial task,   ...

        Yes.   Which is why I want  routers, gateways,  and all  MTAs
(mail transfer agents)  to stay out of it.   That's why I ask that
(today,  1993)  we  NOT LABEL  true plain text as  US-ASCII/ISO-8859-1.
Just leave it alone and let it default at the receiving end.

>                                    and mailers which are trying
>to adopt MIME right away cannot be faulted for deferring
>development of such functionality for a while.

        And let me reiterate that I'm not mad at the Pine developers
(nor the MIME developers;  not mad at anyone,  just trying to push a
point that I think is important and has been missed).   I'm very pleased
with Pine.   It can almost replace RiceMAIL.

        Steve,  it's obvious from your distinction between character set
(set of characters)  and  charset  (encoding of characters)  that you
understand this issue.   Thanks for making up and using those labels!

>					Steve Summit
>					[email protected]

--
Rick Troth <[email protected]>,  Rice University,  Information Systems

Follow-Ups:
- RE: CHARSET considerations
  - From: Masataka Ohta <[email protected]>
- RE: CHARSET considerations
  - From: Harald Tveit Alvestrand <[email protected]>
- RE: CHARSET considerations
  - From: John C Klensin <[email protected]>

Prev by Date: RE: CHARSET considerations
Next by Date: RE: CHARSET considerations
Prev by thread: RE: CHARSET considerations
Next by thread: RE: CHARSET considerations
Index(es):
- Date
- Thread