[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Charset name length(s)



On Sun December 5 2004 13:36, McDonald Ira wrote:
> Hi,
> 
> Relative to Bruce's suggestion that the 40 character restriction
> in names applies only to MIBs:
> 
> (1) MIBs in both SMIv1 and SMIv2 have always supported the ASN.1
>     standard maximum of 63 characters for identifiers
> 
> (2) But, due to underlying linker restrictions, _many_ MIB compilers
>     truncate identifiers at 31 characters (or arbitrarily rewrite
> 	them after about 25 characters)
> 
> (3) So 40 characters isn't a helpful restriction for MIB names.

[I'm copying the ietf-822 list as the issue(s) discussed
affect MIME and the Internet Message Format; responses
to the charset-specific part should remain on ietf-charsets.
I'm also copying the ietf and ietf-languages lists where
a related discussion about language tags is taking place.]

To date, I have merely pointed out that the registration
for MIME names imposes no upper bound, but that the MIB
requirements do indicate a limit for the cs* aliases.  I
have not stated whether I thought that there should be an
explicit limit in general. It is now time to speak up on
that matter.

I am prompted to do so by considerations arising from a
proposal to replace RFC 3066, which defines language tags
and their registration procedure.  Charset names and
language tags are connected by way of RFC 2231, which
amended RFC 2047's definition of "encoded-word" to include
provision for a language tag.  An encoded-word has the
form (my representation, not the official one; for the
latter consult RFC 2231 and errata):

  =?<charset>*<language-tag>?<encoding>?<text>?=

The text part must be at least 4 octets in order to accommodate
B encoding restrictions. Encodings are currently represented
by a single octet, and as encodings are intended to be limited
in number, let's assume that that will suffice indefinitely.
That leaves a maximum of 63 octets for the total length of the
charset name and the language-tag.  RFC 2978 (charset name
registration) provides a procedure for review, so while the
charset name could theoretically be infinite in length, the
review process is expected to catch cases which would prove
problematical for encoded-words -- in fact, so far as I can
determine, the longest charset name suitable for use in an
encoded-word (i.e. charsets suitable for text/plain, considering
the preferred MIME name where specified, otherwise the primary
name) has a length of 45 octets.

RFC 2231 also provides for charset specification in extended
parameters used with Content-Type and Content-Disposition
fields; these are not required to be charsets suitable for
text/plain, and the combined length of charset and language
tag length is much greater than that in an encoded-word
(but still finite).

Under RFC 3066, there is a similar registration and review
procedure, and while again there is the theoretical
possibility of a very long language tag, the longest such
registered tag has a length of 11 octets.

Combined, the longest charset and longest language tag
total 56 octets, which is less than the 63 octet limit
imposed by encoded-word syntax.

Unregistered, private-use charset and/or language-tags
could of course be longer; that does not concern me.
Private-use requires coordination between communicating
parties, and it is a matter for those parties to agree
on private-use tags that fit within the relevant limits.

There is a draft proposal for a replacement of RFC 3066
which would decouple non-private-use language tag use
from the review/registration procedure and which would
provide for infinite length non-private-use language
tags.  That not only represents a problem for encoded-
word use, but it is a problem for Internet Message
Format header (message- and MIME-part) fields which use
language tags, such as RFC 3282's Content-Language and
Accept-Language.  A "New Last Call" has been issued
for the draft proposal on the ietf-announce list:
http://www1.ietf.org/mail-archive/web/ietf-announce/current/msg00755.html

RFC 2047 gives rationale for the encoded-word limit,
and the Message Format limit can be found in RFCs 2821
and 2822.  Given the large deployed base of software
implementing those core Internet protocols, I do not
forsee an opportunity to increase the encoded-word
length limit at this time. Consequently, the maximum
total for registered charset and language tags remains
at no more than 63 octets (and it is conceivable that
future encodings might require a longer text portion).
I suggest that charset names and aliases be limited to
the current maximum of 45 octets, and that language-tags
for use in encoded-words and extended parameters be
limited to 16 octets (an increase of 45% over the
longest registered language tag).  That leaves but 2
octets of expansion room for encoding tags and/or
encoding-driven restrictions on the encoded text.

Ideally, a lower limit for MIME charset names would
be used; aside from a couple of pathological cases, most
MIME-compatible charsets names registered are 17 octets
or less in length; many have shorter aliases.  However,
establishing a limit lower than the longest currently-
registered name would require extraordinary action. It
might be possible to assign MIME-preferred-name aliases
to the excessively-long registered charset names, for
example.  However, the overall maximum (regardless of
whether the charset is compatible with MIME text/plain)
should probably be held at 45 octets.  As for the MIB-
specific aliases, I'll leave specific recommendations up
to others, but 45 octets is certainly capable of
accommodating the current MIB-specific limit of 40 octets.