[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Charset policy - Post Munich



> >    3.1.  What charset to use
> >
> >    All protocols MUST identify, for all character data, which charset
> >    is in use.
> >
> >    Protocols MUST be able to use the UTF-8 charset, which consists of
> >    the ISO 10646 coded character set combined with the UTF-8
> >    character encoding scheme, as defined in [10646] Annex R
> >    (published in Amendment 2), for all text.
> >
> >    They MAY specify how to use other charsets or other character
> >    encoding schemes for ISO 10646, such as UTF-16, but lack of an
> >    ability to use UTF-8 needs clear and solid justification in the
> >    protocol specification document before being entered into or
> >    advanced upon the standards track.

> The above two paragaphs contradict each other. You can't have
> a MUST and then a MAYbe not on the same point. Either make the
> first a SHOULD, or make a MUST for ISO 10646/Unicode, and then
> a SHOULD for UTF-8.

I fail to see a contradiction of here. A protocol must be able to handle UTF-8
if it handles character data. A protocol may elect to handle other charsets as
well, possibly including one derived from other transformation formats of
Unicode.

What I do see here is poor ordering of what is being proposed. I suggest
that it instead say:

    Protocols MUST be able to use the UTF-8 charset, which consists of
    the ISO 10646 coded character set combined with the UTF-8
    character encoding scheme, as defined in [10646] Annex R
    (published in Amendment 2), for all text. Any exceptions
    must be fully justifiable and the justification must be given in the
    protocol specification. A protocol which neither supports UTF-8 nor
    justifies its use of some other charset MUST NOT be entered on the
    standards track.

    Protocols MAY also specify how to use other charsets or other character
    encoding schemes for ISO 10646, such as UTF-16. As always, any protocol
    that elects to support more than one charset MUST provide a field to
    label which charset is being used.

In any case, I'm somewhat opposed to weakening the UTF-8 support requirement to
a SHOULD. It really is a MUST and needs to be stated as such. We can always
make exceptions to a MUST on a case by case basis if need be. I doubt very much
that there will be that many of them.

> >    In most cases, machines cannot deduce the language of a
> >    transmitted text by themselves;

> This is not true. There is enough evidence that for any given
> set of languages, it is possible to devise or generate software
> that identifies the language with accuracy converging to 100%
> as the length of the text increases, and as the amount of
> effort (e.g. table/dictionary size,...) increases. And once
> this effort is done, the gap between what humans can find out
> and what machines can find out is small.

Everything you say  may be true, but it doesn't disprove Harald's statement.
Yes, you may be able to build a machine that deduces language with precision
approaching 100% as the amount of text increases. However, you have not
demontrated that:

(0) Enough text is always going to be available to make this possible.
(1) That the 100% point is actually reached. (Convergence to 100% is not
    the same thing, and in some cases 100% is the only acceptable answer.)
(2) That the set of languages we use is always closed.
(3) Machines in the real world will universally be retrofitted to have this
    capability.

(0), (2), and (3) are in fact demonstrably false. As such, I claim Harald's
statement, which you should not didn't say that machine recognitiion isn't
possible, but only that most machines aren't capable of it right now, is
correct.

Moreover, the point here, that machine recognition of the language being used
cannot be relied upon, is a damned important one that should not be left out. I
really want to forestall finding domain-->language tag tables in some product
somewhere. As such, to forestall further argument I suggest that the paragraph
be reworded to say that at the present time most machines lack the facilities
to deduce language from content.

>    Please note that language information as such is not needed
>    for the end user; humans have no problem identifying the
>    languages they know and separating them from those they
>    don't know.

This point, on the other hand, is demonstrably false since I have a specific
counterexample of my own to offer. I routinely deal with customers in over 50
countries, quite a few of which either use multiple languages or else don't
have domain names that let me deduce country and hence probable language. And I
occasionally receive messages from these places written in a language other
than English, French or Spanish, hence outside my admittedly limited linguistic
skills and limited dictionary set I keep handy.

And moreover, I sometimes cannot figure out what language is being used. (A lot
of the ones I get look like German to me but aren't. Hey, what can I say, my
education in this regard was terrible.) And this actually matters to me, since
depending on the language I'll take the message to different people in the
office or else forward it to various people I know for translation. A language
tag would certainly help me in these cases, although I regret to say that such
tags are rarely used in practice.

The basic problem here is that you're assuming communication is between people
who know each other. This is usually but not always the case, and when it isn't
these tags may actually be useful to a human reader.

I therefore do not support the addition of this text, as it will inevitably
lead to cases where language tags will be omitted when they could have been
useful.

>    Please note that languages are not as clearcut a concept as
>    character sets. There are mixtures of languages, language
>    variants, words that move from one language to another,
>    and text parts that are not in any particular language.

This is a good point and one that needs to be made.

> >    4.2.  Requirement for language tagging
> >
> >    Protocols that transfer text MUST provide for carrying information
> >    about the language of that text.

> This is most probably too strong.

> What about:

> Protocols that transfer text MUST provide for carrying language
> information to the extend and in the granularity that this is
> necessary and apropriate for the operations that the text in
> the protocol is generally intended and used for.

This, on the other hand, is too wishy-washy. We need these tags and we need
for them to be used a lot more than they currently are. What we do not 
need is to have lots of debates about whether or not a given protocol
is needs such a field. It is far better to have fields we end up not using
than to need fields we do not have.

> >    Protocols SHOULD also provide for carrying information about the
> >    language of names.

> Do you seriously want to suggest that we devise some kind of
> language-tag syntax for URLs, Email addresses, host names, and
> so on?

Here I agree that the present document goes too far. Name languages
are _incredibly_ tricky stuff -- if you think words move around a bit, you
should see, say, the Korean-American phone book for the greater LA area!

I think these needs to be dropped entirely.

> >    4.3.  How to identify a language
> >
> >    The RFC 1766 language tag is at the moment the most flexible tool
> >    available for identifying a language; protocols SHOULD use this,
> >    or provide clear and solid justification for doing otherwise in
> >    the document.
> >
> >    In particular, claiming that a language can be deduced from the
> >    charset in use is erroneous and will not be accepted.

> Correct. But isn't this all too obvious, given things like
> iso-8859-1? I don't think you need this in any way to be able
> to reject such claims should they ever come up.

Well, it may be true that everyone knows you cannot deduce language
from iso-8859-1. But what about iso-2022-jp?

The point here is that claims of a _limited_ ability to deduce language
from _some_ charsets have in fact been made, and we need language that
says such claims are unacceptable no matter what.

> >    4.4.  Considerations for negotiation

> Please say "language negotiation".

Agreed.

> >    Protocols where users have text presented to them in response to
> >    user actions MUST provide for multiple languages.

> This is too sweeping. Some people could think that it means that
> a protocol must provide at least two languages, or that every
> implementation has to provide multiple languages.

> Please say something like:

>    Protocols where users have text presented to them in response
>    to user actions MUST provide the means by which implementors
>    can satisfy the language needs of the users.

I have no problem with this.

> >    In some cases, a negotiation where the client proposes a set of
> >    languages and the server replies with one is appropriate; in other
> >    cases, supplying information in all available languages is a
> >    better solution; most sites will either have very few languages
> >    installed or be willing to pay the overhead of sending error
> >    messages in many languages at once.

> I don't agree. There may be only few sites that have many
> languages available, but those may be contacted by users
> with special language needs that can't afford the bandwidth
> (even if the server side providing these many languages has
> no problem with the bandwith).

So what? Harald didn't say that implementations have to provide responses
in multiple languages, merely that providing responses in multiple languages
is a viable approach. And it is viable -- indeed, I have customers that
require it.

> Also, there is an increasing tendency for products to ship
> with all language versions integrated. For a NS or MS server,
> you won't by a specific language version anymore very soon
> in the future.

I fail to see the point here.

> >    Negotiation is useful in the case where one side of the protocol
> >    exchange is able to present text in multiple languages to the
> >    other side, and the other side has a preference for one of these;
> >    the most common example is the text part of error responses, or
> >    Web pages that are available in multiple languages.

> The "one side is able" is somewhat dangerous here. A WG may
> just come and tell you: Our servers all just do English,
> the are not able to do anything else, so this doesn't apply.

The reality is that implementations are going to do this whether we
like it or not. We can require what we like of implementations in
terms of support of mutiple languages and we'll just be ignored.

In other words, there's a real danger here, but it isn't something we can do
much of anything about, and as such this clause is almost entirely harmless.

> >    4.5.  Default Language

> >    When human-readable text must be presented in a context where the
> >    sender has no knowledge of the recipient's language preferences
> >    (such as login failures or E-mailed warnings, or prior to language
> >    negotiation), text SHOULD be presented in Default Language.

> >    The Default Language is English, since this is the language which
> >    most people will be able to get adequate help in interpreting when
> >    working with computers.

> It may be a good idea to replace "most people" by "the greatest number
> of people". This is a sensitive spot, and "most people" is saying
> something about their absolute percentage, whereas we just need to
> say that it is better than any other language we could pick.

Agreed.

> >    Note that negotiating English is NOT the same as Default Language;
> >    Default Language is an emergency measure in otherwise unmanageable
> >    situations. It may be appropriate for application designers to
> >    make sure that messages in Default Language are understandable to
> >    people with a limited understanding of the English language.

> The following is implicit here, but has led to prolonged discussions
> on some lists:

> What I think the text above says is that it's not permitted to
> say: "If the client doesn't negotiate language, this defaults to
> English (or whatever other "default" language)."

> If this is the case, it would be better to explicitly state:

>    Protocols MUST NOT define a default language to avoid language
>    negotiation; language MUST be explicitly negotiated for all
>    languages.

> I think it's better to make this clear, if this is what is desired,
> and something else otherwise, than to have more such discussions.

Agreed.

> >    5.  Locale

> >    In some cases, and especially with text where the user is expected
> >    to do processing on the text, locale information may be usefully
> >    attached to the text; this would identify the sender's opinion
> >    about appropriate rules to follow when processing the document,
> >    which the recipient may choose to agree with or ignore.
> >
> >    This document does not require the communication of locale
> >    information on all text, but encourages its inclusion when
> >    appropriate.

> The above is not very clearcut, but there is probably nothing
> better in sight.

Agreed.

> Please add something like the following:

>    6. Documentation

>    Protocols MUST appropriately document the decisions they have
>    taken with respect to charsets, language information, and other
>    aspects related to internationalization and multilinguality.
>    A format such as that currently used for Security Issues is
>    (highly) recommended.

I would add that they must document their rationale as well as the
decisions.

> Another thing, which should probably go into section 2 or so,
> and which seems needed as a response to some of the questions
> in the plenary in Munich, is a clarification of which protocol
> in a protocol stack is responsible for charset and language
> information. I'm not sure that I have found the best way
> to express this, but it could read as follows:

>    Note that in a protocol stack, it is the responsibility of
>    the highest layer that uses the text to appropriately label
>    it. As an example, it is the responsibility of the standard
>    for mail messages to assure things get correctly labeled in
>    mail messages, even if those are sent over SMTP. It is the
>    responsibility of SMTP to correctly label text which is
>    exchanged as part of the SMTP protocol and is intended for
>    end-user consumption, even if SMTP is run over TCP/IP.
>    It would be the responsibility of IP to label text correctly
>    if it ever would consider using text in its protocol elements
>    (as opposed to transporting text in its payload).

I agree that this is an important point. I also think this is as 
good an attempt as I've seen to describe the requirements in this area.

				Ned