[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Charset policy - Post Munich
Hello Harald,
Many thanks for your excellent work.
> Please check this for consistency with previous comments and comments
> made in Munich.
>
> I'll edit based on comments from this list, send out as I-D, wait
> a week or two, and then think about Last Call.
Looks like very reasonable. I was already planning to write you
because I seemed to remeber that in Munich, you said something
to the effect that this would go out to last call immediately.
> - Ned's charset registry (draft-freed-charset-reg-02.txt)
> - Francois' updated UTF-8 (draft-yergeau-utf8-rev-00.txt)
I'll see Francois next week in San Jose. Anything we should
discuss?
> 3.1. What charset to use
>
> All protocols MUST identify, for all character data, which charset
> is in use.
>
> Protocols MUST be able to use the UTF-8 charset, which consists of
> the ISO 10646 coded character set combined with the UTF-8
> character encoding scheme, as defined in [10646] Annex R
> (published in Amendment 2), for all text.
>
> They MAY specify how to use other charsets or other character
> encoding schemes for ISO 10646, such as UTF-16, but lack of an
> ability to use UTF-8 needs clear and solid justification in the
> protocol specification document before being entered into or
> advanced upon the standards track.
The above two paragaphs contradict each other. You can't have
a MUST and then a MAYbe not on the same point. Either make the
first a SHOULD, or make a MUST for ISO 10646/Unicode, and then
a SHOULD for UTF-8.
> 4. Languages
>
>
> 4.1. The need for language information
>
> All human-readable text has a language.
>
> Many operations, including high quality formatting, text-to-speech
> synthesis, searching, hyphenation, spellchecking and so on need
> access to information about the language of a piece of text. [WC
> 3.1.1.4].
I would suggest replacing "need access" to "benefits from".
This better expresses the fact that a lot of these things
is also possible without explicit language information,
and that even the presence of language information doesn't
make these things perfect.
> In most cases, machines cannot deduce the language of a
> transmitted text by themselves;
This is not true. There is enough evidence that for any given
set of languages, it is possible to devise or generate software
that identifies the language with accuracy converging to 100%
as the length of the text increases, and as the amount of
effort (e.g. table/dictionary size,...) increases. And once
this effort is done, the gap between what humans can find out
and what machines can find out is small.
> the protocol must specify how to
> transfer the language information if it is to be available at all.
> The interaction between language and processing is complex; for
> instance, if I compare "name-of-thing(lang=en)" to "name-of-
> thing(lang=no)" for equality, I will generally expect a match,
> while the word "ask(no)" is a kind of tree, and is hardly useful
> as a command verb.
Good point!
Please add the following:
Please note that language information as such is not needed
for the end user; humans have no problem identifying the
languages they know and separating them from those they
don't know.
Please note that languages are not as clearcut a concept as
character sets. There are mixtures of languages, language
variants, words that move from one language to another,
and text parts that are not in any particular language.
> 4.2. Requirement for language tagging
>
> Protocols that transfer text MUST provide for carrying information
> about the language of that text.
This is most probably too strong.
What about:
Protocols that transfer text MUST provide for carrying language
information to the extend and in the granularity that this is
necessary and apropriate for the operations that the text in
the protocol is generally intended and used for.
> Protocols SHOULD also provide for carrying information about the
> language of names.
Do you seriously want to suggest that we devise some kind of
language-tag syntax for URLs, Email addresses, host names, and
so on?
My objection is not that the syntax for these things is already
hopelessly on the edge; let's just assume we could have a new
start.
As you have said above, you want to ignore language when
comparing names for equality. This makes a lot of sense.
Also, names often appear on paper. Noting the language there
is a strong burden, without much benefit. If names are carried
as part of other text, then using that mechanism for giving
language information should be perfectly appropriate. For
names in isolation, language information doesn't make sense.
> Note that this does NOT mean that such information must always be
> present; the requirement is that if the sender of information
> wishes to send information about the language of a text, the
> protocol provides a well-defined way to carry this information.
Good point.
> 4.3. How to identify a language
>
> The RFC 1766 language tag is at the moment the most flexible tool
> available for identifying a language; protocols SHOULD use this,
> or provide clear and solid justification for doing otherwise in
> the document.
>
> In particular, claiming that a language can be deduced from the
> charset in use is erroneous and will not be accepted.
Correct. But isn't this all too obvious, given things like
iso-8859-1? I don't think you need this in any way to be able
to reject such claims should they ever come up.
> 4.4. Considerations for negotiation
Please say "language negotiation". I get the impression, also
at other points, that Norwegian relies more on implicit things
than (American) English :-).
> Protocols where users have text presented to them in response to
> user actions MUST provide for multiple languages.
This is too sweeping. Some people could think that it means that
a protocol must provide at least two languages, or that every
implementation has to provide multiple languages.
Please say something like:
Protocols where users have text presented to them in response
to user actions MUST provide the means by which implementors
can satisfy the language needs of the users.
> In some cases, a negotiation where the client proposes a set of
> languages and the server replies with one is appropriate; in other
> cases, supplying information in all available languages is a
> better solution; most sites will either have very few languages
> installed or be willing to pay the overhead of sending error
> messages in many languages at once.
I don't agree. There may be only few sites that have many
languages available, but those may be contacted by users
with special language needs that can't afford the bandwidth
(even if the server side providing these many languages has
no problem with the bandwith).
Also, there is an increasing tendency for products to ship
with all language versions integrated. For a NS or MS server,
you won't by a specific language version anymore very soon
in the future.
> Negotiation is useful in the case where one side of the protocol
> exchange is able to present text in multiple languages to the
> other side, and the other side has a preference for one of these;
> the most common example is the text part of error responses, or
> Web pages that are available in multiple languages.
The "one side is able" is somewhat dangerous here. A WG may
just come and tell you: Our servers all just do English,
the are not able to do anything else, so this doesn't apply.
> 4.5. Default Language
>
> When human-readable text must be presented in a context where the
> sender has no knowledge of the recipient's language preferences
> (such as login failures or E-mailed warnings, or prior to language
> negotiation), text SHOULD be presented in Default Language.
>
> The Default Language is English, since this is the language which
> most people will be able to get adequate help in interpreting when
> working with computers.
It may be a good idea to replace "most people" by "the greatest number
of people". This is a sensitive spot, and "most people" is saying
something about their absolute percentage, whereas we just need to
say that it is better than any other language we could pick.
> Note that negotiating English is NOT the same as Default Language;
> Default Language is an emergency measure in otherwise unmanageable
> situations. It may be appropriate for application designers to
> make sure that messages in Default Language are understandable to
> people with a limited understanding of the English language.
The following is implicit here, but has led to prolonged discussions
on some lists:
What I think the text above says is that it's not permitted to
say: "If the client doesn't negotiate language, this defaults to
English (or whatever other "default" language)."
If this is the case, it would be better to explicitly state:
Protocols MUST NOT define a default language to avoid language
negotiation; language MUST be explicitly negotiated for all
languages.
I think it's better to make this clear, if this is what is desired,
and something else otherwise, than to have more such discussions.
> 5. Locale
> In some cases, and especially with text where the user is expected
> to do processing on the text, locale information may be usefully
> attached to the text; this would identify the sender's opinion
> about appropriate rules to follow when processing the document,
> which the recipient may choose to agree with or ignore.
>
> This document does not require the communication of locale
> information on all text, but encourages its inclusion when
> appropriate.
The above is not very clearcut, but there is probably nothing
better in sight.
Please add something like the following:
6. Documentation
Protocols MUST appropriately document the decisions they have
taken with respect to charsets, language information, and other
aspects related to internationalization and multilinguality.
A format such as that currently used for Security Issues is
(highly) recommended.
Another thing, which should probably go into section 2 or so,
and which seems needed as a response to some of the questions
in the plenary in Munich, is a clarification of which protocol
in a protocol stack is responsible for charset and language
information. I'm not sure that I have found the best way
to express this, but it could read as follows:
Note that in a protocol stack, it is the responsibility of
the highest layer that uses the text to appropriately label
it. As an example, it is the responsibility of the standard
for mail messages to assure things get correctly labeled in
mail messages, even if those are sent over SMTP. It is the
responsibility of SMTP to correctly label text which is
exchanged as part of the SMTP protocol and is intended for
end-user consumption, even if SMTP is run over TCP/IP.
It would be the responsibility of IP to label text correctly
if it ever would consider using text in its protocol elements
(as opposed to transporting text in its payload).
Regards, Martin.
draft Charset policy June 97
IETF Policy on Character Sets and Languages
Fri Aug 29 10:41:03 MET DST 1997
Harald Tveit Alvestrand
UNINETT
Harald.T.Alvestrand@uninett.no
Status of this Memo
This draft document is being circulated for comment.
Please send comments to the author, or to the mailing list <ietf-
charsets@innosoft.com>
The following text is required by the Internet-draft rules:
This document is an Internet Draft. Internet Drafts are working
documents of the Internet Engineering Task Force (IETF), its
Areas, and its Working Groups. Note that other groups may also
distribute working documents as Internet Drafts.
Internet Drafts are draft documents valid for a maximum of six
months. Internet Drafts may be updated, replaced, or obsoleted by
other documents at any time. It is not appropriate to use
Internet Drafts as reference material or to cite them other than
as a "working draft" or "work in progress."
Please check the I-D abstract listing contained in each Internet
Draft directory to learn the current status of this or any other
Internet Draft.
The file name of this version is draft-alvestrand-charset-
policy-01.txt
Alvestrand Expires Dec 97 [Page 1]
draft Charset policy June 97
1. Introduction
The Internet is international.
With the international Internet follows an absolute requirement to
interchange data in a multiplicity of languages, which in turn
utilize a bewildering number of characters.
This document is (INTENDED TO BE) the current policies being
applied by the Internet Engineering Steering Group towards the
standardization efforts in the Internet Engineering Task Force in
order to help Internet protocols fulfil these requirements.
The document is very much based upon the recommendations of the
IAB Character Set Workshop of February 29-March 1, 1996, which is
documented in RFC 2130 [WR]. This document attempts to be concise,
explicit and clear; people wanting more background are encouraged
to read RFC 2130.
The document uses the terms "MUST", "SHOULD" and "MAY", and their
negatives, in the way described in [RFC 2119]. In this case, "the
specification" as used by RFC 2119 refers to the processing of
protocols being submitted to the IETF standards process.
2. Where to do internationalization
Internationalization is for humans. This means that protocols are
not subject to internationalization; text strings are. Where
protocols may masquerade as text strings, such as in many IETF
application layer protocols, protocols MUST specify which parts
are protocol and which are text. [WR 2.2.1.1]
Names are a problem, because people feel strongly about them, many
of them are mostly for local usage, and all of them tend to leak
out of the local context at times. RFC 1958 [ARCH] recommends US-
ASCII for all globally visible names.
This document does not mandate a policy on name
internationalization, but requires that all protocols describe
whether names are internationalized or US-ASCII.
Alvestrand Expires Dec 97 [Page 2]
draft Charset policy June 97
3. Definition of Terms
This document uses the term "charset" to mean a set of rules for
mapping from a sequence of octets to a sequence of characters,
such as the combination of a coded character set and a character
encoding scheme; this is also what is used as an identifier in
MIME "charset=" parameters, and registered in the IANA charset
registry [REG].
For a definition of the term "coded character set", refer to the
workshop report.
A "name" is an identifier such as a person's name, a hostname, a
domainname, a filename or an E-mail address; it is often treated
as an identifier rather than as a piece of text, and is often used
in protocols as an identifier for entities, without surrounding
text.
3.1. What charset to use
All protocols MUST identify, for all character data, which charset
is in use.
Protocols MUST be able to use the UTF-8 charset, which consists of
the ISO 10646 coded character set combined with the UTF-8
character encoding scheme, as defined in [10646] Annex R
(published in Amendment 2), for all text.
They MAY specify how to use other charsets or other character
encoding schemes for ISO 10646, such as UTF-16, but lack of an
ability to use UTF-8 needs clear and solid justification in the
protocol specification document before being entered into or
advanced upon the standards track.
For existing protocols or protocols that move data from existing
datastores, support of other charsets, or even using a default
other than UTF-8, may be a requirement. This is acceptable, but
UTF-8 support MUST be possible.
When using other charsets than UTF-8, these MUST be registered in
the IANA charset registry, if necessary by registering them when
the protocol is published.
Alvestrand Expires Dec 97 [Page 3]
draft Charset policy June 97
(Note: ISO 10646 calls the UTF-8 CES a "Transfer Format" rather
than a "character encoding scheme", but it fits the charset report
definition of a character encoding scheme).
3.2. How to decide a charset
In some cases, like HTTP, there is direct or semi-direct
communication between the producer and the consumer of data
containing text. In such cases, it may make sense to negotiate a
charset before sending data.
In other cases, like E-mail or stored data, there is no such
communication, and the best one can do is to make sure the charset
is clearly identified with the stored data, and choosing a charset
that is as widely known as possible.
Note that a charset is an absolute; text that is encoded in a
charset cannot be rendered comprehensibly without supporting that
charset.
(This also applies to English; charsets like EBCDIC do NOT have
ASCII as a proper subset)
Negotiating a charset may be regarded as an interim mechanism that
is to be supported until UTF-8 support is prevalent; however, the
timeframe of "interim" may be at least 50 years, so there is every
reason to think of it as permanent in practice.
4. Languages
4.1. The need for language information
All human-readable text has a language.
Many operations, including high quality formatting, text-to-speech
synthesis, searching, hyphenation, spellchecking and so on need
access to information about the language of a piece of text. [WC
3.1.1.4].
Humans have some tolerance for foreign languages, but are
Alvestrand Expires Dec 97 [Page 4]
draft Charset policy June 97
generally very unhappy with being presented text in a language
they do not understand; this is why negotiation of language is
needed.
In most cases, machines cannot deduce the language of a
transmitted text by themselves; the protocol must specify how to
transfer the language information if it is to be available at all.
The interaction between language and processing is complex; for
instance, if I compare "name-of-thing(lang=en)" to "name-of-
thing(lang=no)" for equality, I will generally expect a match,
while the word "ask(no)" is a kind of tree, and is hardly useful
as a command verb.
4.2. Requirement for language tagging
Protocols that transfer text MUST provide for carrying information
about the language of that text.
Protocols SHOULD also provide for carrying information about the
language of names.
Note that this does NOT mean that such information must always be
present; the requirement is that if the sender of information
wishes to send information about the language of a text, the
protocol provides a well-defined way to carry this information.
4.3. How to identify a language
The RFC 1766 language tag is at the moment the most flexible tool
available for identifying a language; protocols SHOULD use this,
or provide clear and solid justification for doing otherwise in
the document.
In particular, claiming that a language can be deduced from the
charset in use is erroneous and will not be accepted.
Note also that a language is distinct from a POSIX locale; a POSIX
locale identifies a set of cultural conventions, which may imply a
language (the POSIX or "C" locale of course do not), while a
language tag as described in RFC 1766 identifies only a language.
Alvestrand Expires Dec 97 [Page 5]
draft Charset policy June 97
4.4. Considerations for negotiation
Protocols where users have text presented to them in response to
user actions MUST provide for multiple languages.
In some cases, a negotiation where the client proposes a set of
languages and the server replies with one is appropriate; in other
cases, supplying information in all available languages is a
better solution; most sites will either have very few languages
installed or be willing to pay the overhead of sending error
messages in many languages at once.
Negotiation is useful in the case where one side of the protocol
exchange is able to present text in multiple languages to the
other side, and the other side has a preference for one of these;
the most common example is the text part of error responses, or
Web pages that are available in multiple languages.
Negotiating a language should be regarded as a permanent
requirement of the protocol that will not go away at any time in
the future.
In many cases, it should be possible to include it as part of the
connection establishment, together with authentication and other
preferences negotiation.
4.5. Default Language
When human-readable text must be presented in a context where the
sender has no knowledge of the recipient's language preferences
(such as login failures or E-mailed warnings, or prior to language
negotiation), text SHOULD be presented in Default Language.
The Default Language is English, since this is the language which
most people will be able to get adequate help in interpreting when
working with computers.
Note that negotiating English is NOT the same as Default Language;
Default Language is an emergency measure in otherwise unmanageable
situations. It may be appropriate for application designers to
make sure that messages in Default Language are understandable to
people with a limited understanding of the English language.
Alvestrand Expires Dec 97 [Page 6]
draft Charset policy June 97
5. Locale
The POSIX standard [POSIX] defines a concept called a "locale",
which includes a lot of information about collating order for
sorting, date format, currency format and so on.
In some cases, and especially with text where the user is expected
to do processing on the text, locale information may be usefully
attached to the text; this would identify the sender's opinion
about appropriate rules to follow when processing the document,
which the recipient may choose to agree with or ignore.
This document does not require the communication of locale
information on all text, but encourages its inclusion when
appropriate.
Note that language and character set information will often be
present as parts of a locale tag (such as no_NO.iso-8859-1; the
language is before the underscore and the character set is after
the dot); care must be taken to define precisely which
specification of character set and language applies to any one
text item.
The default locale is the "POSIX" locale.
6. Security considerations
Apart from the fact that security warnings in a foreign language
may cause inappropriate behaviour from the user, and the fact that
multilingual systems usually have problems with consistency
between language variants, no security considerations relevant
have been identified.
7. References
[10646]
ISO/IEC, Information Technology - Universal Multiple-Octet
Coded Character Set (UCS) - Part 1: Architecture and Basic
Multilingual Plane, May 1993, with amendments
Alvestrand Expires Dec 97 [Page 7]
draft Charset policy June 97
[RFC 2119]
S. Bradner, "Key words for use in RFCs to Indicate
Requirement Levels", 03/26/1997 - RFC 2119
[WR] C. Weider, C. Preston, K. Simonsen, H. Alvestrand, R.
Atkinson, M. Crispin, P. Svanberg, "The Report of the IAB
Character Set Workshop held 29 February - 1 March, 1996",
04/21/1997, RFC 2130
[ARCH]
B. Carpenter, "Architectural Principles of the Internet",
06/06/1996, RFC 1958
[POSIX]
ISO/IEC 9945-2:1993 Information technology -- Portable
Operating System Interface (POSIX) -- Part 2: Shell and
Utilities
[REG]
N. Freed, J. Postel: IANA Charset Registration Procedures,
Work In Progress (draft-freed-charset-reg-02.txt)
[UTF-8]
F. Yergeau: UTF-8, a transformation format of Unicode and
ISO 10646, Work In Progress (draft-yergeau-utf8-rev-00.txt,
obsoletes RFC 2044)
8. Author's address
Harald Tveit Alvestrand
UNINETT
P.O.Box 6883 Elgeseter
N-7002 TRONDHEIM
NORWAY
+47 73 59 70 94
Harald.T.Alvestrand@uninett.no
Alvestrand Expires Dec 97 [Page 8]