[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: Charset policy - Post Munich
I think Ned is completely correct here. The workshop report thought long
and hard about requiring language tagging and mandatory UTF-8 and
realized that this is the only way to make things work with the stupid
machines we have now :^)
Chris
> -----Original Message-----
> From: Ned Freed [SMTP:Ned.Freed@innosoft.com]
> Sent: Sunday, August 31, 1997 1:07 PM
> To: ietf-charsets@innosoft.com
> Subject: Re: Charset policy - Post Munich
>
> > > 3.1. What charset to use
> > >
> > > All protocols MUST identify, for all character data, which
> charset
> > > is in use.
> > >
> > > Protocols MUST be able to use the UTF-8 charset, which consists
> of
> > > the ISO 10646 coded character set combined with the UTF-8
> > > character encoding scheme, as defined in [10646] Annex R
> > > (published in Amendment 2), for all text.
> > >
> > > They MAY specify how to use other charsets or other character
> > > encoding schemes for ISO 10646, such as UTF-16, but lack of an
> > > ability to use UTF-8 needs clear and solid justification in the
> > > protocol specification document before being entered into or
> > > advanced upon the standards track.
>
> > The above two paragaphs contradict each other. You can't have
> > a MUST and then a MAYbe not on the same point. Either make the
> > first a SHOULD, or make a MUST for ISO 10646/Unicode, and then
> > a SHOULD for UTF-8.
>
> I fail to see a contradiction of here. A protocol must be able to
> handle UTF-8
> if it handles character data. A protocol may elect to handle other
> charsets as
> well, possibly including one derived from other transformation formats
> of
> Unicode.
>
> What I do see here is poor ordering of what is being proposed. I
> suggest
> that it instead say:
>
> Protocols MUST be able to use the UTF-8 charset, which consists of
> the ISO 10646 coded character set combined with the UTF-8
> character encoding scheme, as defined in [10646] Annex R
> (published in Amendment 2), for all text. Any exceptions
> must be fully justifiable and the justification must be given in
> the
> protocol specification. A protocol which neither supports UTF-8
> nor
> justifies its use of some other charset MUST NOT be entered on the
> standards track.
>
> Protocols MAY also specify how to use other charsets or other
> character
> encoding schemes for ISO 10646, such as UTF-16. As always, any
> protocol
> that elects to support more than one charset MUST provide a field
> to
> label which charset is being used.
>
> In any case, I'm somewhat opposed to weakening the UTF-8 support
> requirement to
> a SHOULD. It really is a MUST and needs to be stated as such. We can
> always
> make exceptions to a MUST on a case by case basis if need be. I doubt
> very much
> that there will be that many of them.
>
> > > In most cases, machines cannot deduce the language of a
> > > transmitted text by themselves;
>
> > This is not true. There is enough evidence that for any given
> > set of languages, it is possible to devise or generate software
> > that identifies the language with accuracy converging to 100%
> > as the length of the text increases, and as the amount of
> > effort (e.g. table/dictionary size,...) increases. And once
> > this effort is done, the gap between what humans can find out
> > and what machines can find out is small.
>
> Everything you say may be true, but it doesn't disprove Harald's
> statement.
> Yes, you may be able to build a machine that deduces language with
> precision
> approaching 100% as the amount of text increases. However, you have
> not
> demontrated that:
>
> (0) Enough text is always going to be available to make this possible.
> (1) That the 100% point is actually reached. (Convergence to 100% is
> not
> the same thing, and in some cases 100% is the only acceptable
> answer.)
> (2) That the set of languages we use is always closed.
> (3) Machines in the real world will universally be retrofitted to have
> this
> capability.
>
> (0), (2), and (3) are in fact demonstrably false. As such, I claim
> Harald's
> statement, which you should not didn't say that machine recognitiion
> isn't
> possible, but only that most machines aren't capable of it right now,
> is
> correct.
>
> Moreover, the point here, that machine recognition of the language
> being used
> cannot be relied upon, is a damned important one that should not be
> left out. I
> really want to forestall finding domain-->language tag tables in some
> product
> somewhere. As such, to forestall further argument I suggest that the
> paragraph
> be reworded to say that at the present time most machines lack the
> facilities
> to deduce language from content.
>
> > Please note that language information as such is not needed
> > for the end user; humans have no problem identifying the
> > languages they know and separating them from those they
> > don't know.
>
> This point, on the other hand, is demonstrably false since I have a
> specific
> counterexample of my own to offer. I routinely deal with customers in
> over 50
> countries, quite a few of which either use multiple languages or else
> don't
> have domain names that let me deduce country and hence probable
> language. And I
> occasionally receive messages from these places written in a language
> other
> than English, French or Spanish, hence outside my admittedly limited
> linguistic
> skills and limited dictionary set I keep handy.
>
> And moreover, I sometimes cannot figure out what language is being
> used. (A lot
> of the ones I get look like German to me but aren't. Hey, what can I
> say, my
> education in this regard was terrible.) And this actually matters to
> me, since
> depending on the language I'll take the message to different people in
> the
> office or else forward it to various people I know for translation. A
> language
> tag would certainly help me in these cases, although I regret to say
> that such
> tags are rarely used in practice.
>
> The basic problem here is that you're assuming communication is
> between people
> who know each other. This is usually but not always the case, and when
> it isn't
> these tags may actually be useful to a human reader.
>
> I therefore do not support the addition of this text, as it will
> inevitably
> lead to cases where language tags will be omitted when they could have
> been
> useful.
>
> > Please note that languages are not as clearcut a concept as
> > character sets. There are mixtures of languages, language
> > variants, words that move from one language to another,
> > and text parts that are not in any particular language.
>
> This is a good point and one that needs to be made.
>
> > > 4.2. Requirement for language tagging
> > >
> > > Protocols that transfer text MUST provide for carrying
> information
> > > about the language of that text.
>
> > This is most probably too strong.
>
> > What about:
>
> > Protocols that transfer text MUST provide for carrying language
> > information to the extend and in the granularity that this is
> > necessary and apropriate for the operations that the text in
> > the protocol is generally intended and used for.
>
> This, on the other hand, is too wishy-washy. We need these tags and we
> need
> for them to be used a lot more than they currently are. What we do not
>
> need is to have lots of debates about whether or not a given protocol
> is needs such a field. It is far better to have fields we end up not
> using
> than to need fields we do not have.
>
> > > Protocols SHOULD also provide for carrying information about
> the
> > > language of names.
>
> > Do you seriously want to suggest that we devise some kind of
> > language-tag syntax for URLs, Email addresses, host names, and
> > so on?
>
> Here I agree that the present document goes too far. Name languages
> are _incredibly_ tricky stuff -- if you think words move around a bit,
> you
> should see, say, the Korean-American phone book for the greater LA
> area!
>
> I think these needs to be dropped entirely.
>
> > > 4.3. How to identify a language
> > >
> > > The RFC 1766 language tag is at the moment the most flexible
> tool
> > > available for identifying a language; protocols SHOULD use
> this,
> > > or provide clear and solid justification for doing otherwise in
> > > the document.
> > >
> > > In particular, claiming that a language can be deduced from the
> > > charset in use is erroneous and will not be accepted.
>
> > Correct. But isn't this all too obvious, given things like
> > iso-8859-1? I don't think you need this in any way to be able
> > to reject such claims should they ever come up.
>
> Well, it may be true that everyone knows you cannot deduce language
> from iso-8859-1. But what about iso-2022-jp?
>
> The point here is that claims of a _limited_ ability to deduce
> language
> from _some_ charsets have in fact been made, and we need language that
> says such claims are unacceptable no matter what.
>
> > > 4.4. Considerations for negotiation
>
> > Please say "language negotiation".
>
> Agreed.
>
> > > Protocols where users have text presented to them in response
> to
> > > user actions MUST provide for multiple languages.
>
> > This is too sweeping. Some people could think that it means that
> > a protocol must provide at least two languages, or that every
> > implementation has to provide multiple languages.
>
> > Please say something like:
>
> > Protocols where users have text presented to them in response
> > to user actions MUST provide the means by which implementors
> > can satisfy the language needs of the users.
>
> I have no problem with this.
>
> > > In some cases, a negotiation where the client proposes a set of
> > > languages and the server replies with one is appropriate; in
> other
> > > cases, supplying information in all available languages is a
> > > better solution; most sites will either have very few languages
> > > installed or be willing to pay the overhead of sending error
> > > messages in many languages at once.
>
> > I don't agree. There may be only few sites that have many
> > languages available, but those may be contacted by users
> > with special language needs that can't afford the bandwidth
> > (even if the server side providing these many languages has
> > no problem with the bandwith).
>
> So what? Harald didn't say that implementations have to provide
> responses
> in multiple languages, merely that providing responses in multiple
> languages
> is a viable approach. And it is viable -- indeed, I have customers
> that
> require it.
>
> > Also, there is an increasing tendency for products to ship
> > with all language versions integrated. For a NS or MS server,
> > you won't by a specific language version anymore very soon
> > in the future.
>
> I fail to see the point here.
>
> > > Negotiation is useful in the case where one side of the
> protocol
> > > exchange is able to present text in multiple languages to the
> > > other side, and the other side has a preference for one of
> these;
> > > the most common example is the text part of error responses, or
> > > Web pages that are available in multiple languages.
>
> > The "one side is able" is somewhat dangerous here. A WG may
> > just come and tell you: Our servers all just do English,
> > the are not able to do anything else, so this doesn't apply.
>
> The reality is that implementations are going to do this whether we
> like it or not. We can require what we like of implementations in
> terms of support of mutiple languages and we'll just be ignored.
>
> In other words, there's a real danger here, but it isn't something we
> can do
> much of anything about, and as such this clause is almost entirely
> harmless.
>
> > > 4.5. Default Language
>
> > > When human-readable text must be presented in a context where
> the
> > > sender has no knowledge of the recipient's language preferences
> > > (such as login failures or E-mailed warnings, or prior to
> language
> > > negotiation), text SHOULD be presented in Default Language.
>
> > > The Default Language is English, since this is the language
> which
> > > most people will be able to get adequate help in interpreting
> when
> > > working with computers.
>
> > It may be a good idea to replace "most people" by "the greatest
> number
> > of people". This is a sensitive spot, and "most people" is saying
> > something about their absolute percentage, whereas we just need to
> > say that it is better than any other language we could pick.
>
> Agreed.
>
> > > Note that negotiating English is NOT the same as Default
> Language;
> > > Default Language is an emergency measure in otherwise
> unmanageable
> > > situations. It may be appropriate for application designers to
> > > make sure that messages in Default Language are understandable
> to
> > > people with a limited understanding of the English language.
>
> > The following is implicit here, but has led to prolonged discussions
> > on some lists:
>
> > What I think the text above says is that it's not permitted to
> > say: "If the client doesn't negotiate language, this defaults to
> > English (or whatever other "default" language)."
>
> > If this is the case, it would be better to explicitly state:
>
> > Protocols MUST NOT define a default language to avoid language
> > negotiation; language MUST be explicitly negotiated for all
> > languages.
>
> > I think it's better to make this clear, if this is what is desired,
> > and something else otherwise, than to have more such discussions.
>
> Agreed.
>
> > > 5. Locale
>
> > > In some cases, and especially with text where the user is
> expected
> > > to do processing on the text, locale information may be
> usefully
> > > attached to the text; this would identify the sender's opinion
> > > about appropriate rules to follow when processing the document,
> > > which the recipient may choose to agree with or ignore.
> > >
> > > This document does not require the communication of locale
> > > information on all text, but encourages its inclusion when
> > > appropriate.
>
> > The above is not very clearcut, but there is probably nothing
> > better in sight.
>
> Agreed.
>
> > Please add something like the following:
>
> > 6. Documentation
>
> > Protocols MUST appropriately document the decisions they have
> > taken with respect to charsets, language information, and other
> > aspects related to internationalization and multilinguality.
> > A format such as that currently used for Security Issues is
> > (highly) recommended.
>
> I would add that they must document their rationale as well as the
> decisions.
>
> > Another thing, which should probably go into section 2 or so,
> > and which seems needed as a response to some of the questions
> > in the plenary in Munich, is a clarification of which protocol
> > in a protocol stack is responsible for charset and language
> > information. I'm not sure that I have found the best way
> > to express this, but it could read as follows:
>
> > Note that in a protocol stack, it is the responsibility of
> > the highest layer that uses the text to appropriately label
> > it. As an example, it is the responsibility of the standard
> > for mail messages to assure things get correctly labeled in
> > mail messages, even if those are sent over SMTP. It is the
> > responsibility of SMTP to correctly label text which is
> > exchanged as part of the SMTP protocol and is intended for
> > end-user consumption, even if SMTP is run over TCP/IP.
> > It would be the responsibility of IP to label text correctly
> > if it ever would consider using text in its protocol elements
> > (as opposed to transporting text in its payload).
>
> I agree that this is an important point. I also think this is as
> good an attempt as I've seen to describe the requirements in this
> area.
>
> Ned