[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Suggested character set policy for the IETF
internet-drafts: Please publish the attached document.
Mailing lists: I've redirected replies to the single list
ietf-languages@uninett.no - please respect this.
You can subscribe to this list by sending mail to majordomo@uninett.no
with the words
subscribe ietf-languages
in the BODY of the message.
Regards,
Harald T. Alvestrand
draft Charset policy June 97
IETF Policy on Character Sets and Languages
Sun Jun 15 14:23:36 MET DST 1997
Harald Tveit Alvestrand
UNINETT
Harald.T.Alvestrand@uninett.no
Status of this Memo
This draft document is being circulated for comment.
Please send comments to the author.
The following text is required by the Internet-draft rules:
This document is an Internet Draft. Internet Drafts are working
documents of the Internet Engineering Task Force (IETF), its
Areas, and its Working Groups. Note that other groups may also
distribute working documents as Internet Drafts.
Internet Drafts are draft documents valid for a maximum of six
months. Internet Drafts may be updated, replaced, or obsoleted by
other documents at any time. It is not appropriate to use
Internet Drafts as reference material or to cite them other than
as a "working draft" or "work in progress."
Please check the I-D abstract listing contained in each Internet
Draft directory to learn the current status of this or any other
Internet Draft.
The file name of this version is draft-ietf-charset-policy-00.txt
Alvestrand Expires Dec 97 [Page 1]
draft Charset policy June 97
1. Introduction
The Internet is international.
With the international Internet follows an absolute requirement to
interchange data in a multiplicity of languages, which in turn
utilize a bewildering number of characters or other character-like
representation mechanisms.
This document is (INTENDED TO BE) the current policies being
applied by the Internet Engineering Steering Group towards the
standardization efforts in the Internet Engineering Task Force in
order to help Internet protocols fulfil these requirements.
The document is very much based upon the recommendations of the
IAB Character Set Workshop of February 29-March 1, 1996, which is
documented in RFC 2130 [WR]. This document attempts to be concise,
explicit and clear; people wanting more background are encouraged
to read RFC 2130.
The document uses the terms "MUST", "SHOULD" and "MAY", and their
negatives, in the way described in [RFC 2119]. In this case, "the
specification" as used by RFC 2119 refers to the processing of
protocols being submitted to the IETF standards process.
2. Where to do internationalization
Internationalization is for humans. This means that protocols are
not subject to internationalization; text strings are. Where
protocols may masquerade as text strings, such as in many IETF
application layer protocols, protocols MUST specify which parts
are protocol and which are text. [WR 2.2.1.1]
Names are a problem, because people feel strongly about them, many
of them are mostly for local usage, and all of them tend to leak
out of the local context at times. RFC 1958 [ARCH] recommends US-
ASCII for all globally visible names.
This document does not mandate a policy on name
internationalization, but requires that all protocols describe
whether names are internationalized or US-ASCII.
Alvestrand Expires Dec 97 [Page 2]
draft Charset policy June 97
3. Character sets
For a definition of the term "character set", refer to the
workshop report. Like MIME, this document uses it to mean the
combination of a coded character set and a character encoding
scheme.
3.1. What character set to use
All protocols MUST identify, for all character data, which
character set is in use.
Protocols MUST be able to use the ISO 10646 coded character set,
with the UTF-8 character encoding scheme, for all text. (This is
called "UTF-8" in the rest of this document)
They MAY specify how to use other character sets or other
character encoding schemes, such as UTF-16, but lack of an ability
to use UTF-8 needs clear and solid justification in the protocol
specification document before being entered into or advanced upon
the standards track.
For existing protocols or protocols that move data from existing
datastores, support of other character sets, or even using a
default other than UTF-8, may be a requirement. This is
acceptable, but UTF-8 support MUST be possible.
When using other character sets than UTF-8, these MUST be
registered in the IANA character set registry, if necessary by
registering them when the protocol is published.
3.2. How to decide a character set
In some cases, like HTTP, there is direct or semi-direct
communication between the producer and the consumer of a character
set. In this case, it may make sense to negotiate a character set
before sending data.
In other cases, like E-mail or stored data, there is no such
communication, and the best one can do is to make sure the
character set is clearly identified with the stored data, and
choosing a character set that is as widely known as possible.
Alvestrand Expires Dec 97 [Page 3]
draft Charset policy June 97
Note that a character set is an absolute; for almost all languages
but English and a few other Latin-based scripts, text cannot be
rendered comprehensibly without supporting the right character
set.
Negotiating a character set may be regarded as an interim
mechanism that is to be supported until UTF-8 support is
prevalent; however, the timeframe of "interim" may be at least 50
years, so there is every reason to think of it as permanent in
practice.
4. Languages
4.1. The need for language information
All human-readable text has a language.
Many operations, including high quality formatting, text-to-speech
synthesis, searching, sorting, spellchecking and so on need access
to information about the language of a piece of text. [WC
3.1.1.4].
Humans have some tolerance for foreign languages, but are
generally dissatisfied with being presented text in a language
they do not understand; this is why negotiation of language is
needed.
In most cases, machines cannot deduce the language by themselves;
the protocol must specify how to transfer the language information
if it is to be available at all.
(Some items, like domain names and other names, may in some cases
be very useful without this information.)
The interaction between language and processing is complex; for
instance, if I compare "hosta(lang=en)" to "hosta(lang=no)" I will
generally expect a match, while "aasmund" sorts after "attaboy"
according to Norwegian rules, but before it using English rules.
(the "aa" is sorted together with "latin letter a with ring
above", which is at the end of the Norwegian alphabet).
Alvestrand Expires Dec 97 [Page 4]
draft Charset policy June 97
4.2. How to identify a language
The RFC 1766 language tag is at the moment the most flexible tool
available for identifying a language; protocols SHOULD use this,
or provide clear and solid justification for doing otherwise in
the document.
4.3. Considerations for negotiation
Protocols that transfer human-readable text MUST provide for
multiple languages.
In some cases, a negotiation where the client proposes a set of
languages and the server replies with one is appropriate; in other
cases, supplying information in all available languages is a
better solution; most sites will either have very few languages
installed or be willing to pay the overhead of sending error
messages in many languages at once.
Negotiation is useful in the case where one side of the protocol
exchange is able to present text in multiple languages to the
other side, and the other side has a preference for one of these;
the most common example is the text part of error responses, or
Web pages that are available in multiple languages.
Negotiating a language should be regarded as a permanent
requirement of the protocol that will not go away at any time in
the future.
In most cases, it should be possible to include it as part of the
connection establishment, together with authentication and other
preferences negotiation.
4.4. Default Language
When human-readable text must be presented in a context where the
sender has no knowledge of the recipient's language preferences
(such as login failures or E-mailed warnings, or prior to language
negotiation), text SHOULD be presented in Default Language.
The Default Language is English, since this is the language which
most people will be able to get adequate help in interpreting when
Alvestrand Expires Dec 97 [Page 5]
draft Charset policy June 97
working with computers.
Note that negotiating English is NOT the same as Default Language;
Default Language is an emergency measure in otherwise unmanageable
situations.
5. Locale
POSIX defines a concept called a "locale", which includes a lot of
information about collating order, date format, currency format
and so on.
In some cases, and especially with text where the user is expected
to do processing on the text, locale information may be usefully
attached to the text.
This document does not require the communication of locale
information on all text, but encourages its inclusion when
appropriate.
Note that the language and character set will often be present as
parts of a locale tag (such as no_NO.iso-8859-1; the language is
before the _ and the character set is after the dot); care must be
taken to define precisely which specification of character set and
language applies to any one text item.
The default locale is the POSIX locale.
6. Security considerations
Apart from the fact that security warnings in a foreign language
may cause inappropriate behaviour from the user, and the fact that
multilingual systems usually have problems with consistency
between language variants, no security considerations relevant
have been identified.
7. References
[RFC 2119]
S. Bradner, "Key words for use in RFCs to Indicate
Alvestrand Expires Dec 97 [Page 6]
draft Charset policy June 97
Requirement Levels", 03/26/1997 - RFC 2119
[WR] C. Weider, C. Preston, K. Simonsen, H. Alvestrand, R.
Atkinson, M. Crispin, P. Svanberg, "The Report of the IAB
Character Set Workshop held 29 February - 1 March, 1996",
04/21/1997, RFC 2130
[ARCH]
B. Carpenter, "Architectural Principles of the Internet",
06/06/1996, RFC 1958
8. Author's address
Harald Tveit Alvestrand
UNINETT
P.O.Box 6883 Elgeseter
N-7002 TRONDHEIM
NORWAY
+47 73 59 70 94
Harald.T.Alvestrand@uninett.no
Alvestrand Expires Dec 97 [Page 7]