I have a couple of notes, then a suggested proposl.
1. I ran a check, doing the following things:
- uppercasing each string
- removing all characters except A-Z and 0-9
- removing all leading zeros (zeros not preceded by a number)
I then checked for collisions, where two different names (or aliases for different names) matched under these circumstances. The results are that there are only 2 collisions:
Collision between: iso-ir-91 (JIS_C6229-1984-a) and iso-ir-9-1 (NATS-DANO)
Collision between: iso-ir-92 (JIS_C6229-1984-b) and iso-ir-9-2 (NATS-DANO-ADD)
Both of these (looking at http://www.itscj.ipsj.or.jp/ISO-IR/overview.htm) are very old code pages, not in common use, so they can be grandfathered in.
2. The character set that is actually in use are A-Z, 0-9 plus:
'_', example: ANSI_X3.110-1983
'.', example: ANSI_X3.110-1983
'-', example: ANSI_X3.110-1983
':', example: ISO_5427:1981
'+', example: PC-MULTILINGUAL-850+EURO
'(', example: NF_Z_62-010_(1973)
')', example: NF_Z_62-010_(1973)
Notice that the last two are in violation of http://www.ietf.org/rfc/rfc2978.txt, and should be removed!
3. As Markus said, while there should be a limit to the names, and while http://www.iana.org/assignments/character-sets gives one, but there are violations in that very file:
Name >40: Extended_UNIX_Code_Fixed_Width_for_Japanese
Name >40: Extended_UNIX_Code_Packed_Format_for_Japanese
Maximum Length 45.
So the name limit should be extended to accomodate those.
4. The file in http://www.iana.org/assignments/character-sets is rather clumsy to parse.
a. One has to key off of "-------------" at the start of line to know when to start parsing, and "REFERENCES" to know when to stop. (And if these are not invariants, then parsers may have to change over time!).
b. The exact format of the file is not described.
5. So I'd like to sum up the results of this discussion with a concrete proposal. In http://www.iana.org/assignments/character-sets,
A. Replace the text:
The character set names may be up to 40 characters taken from the
printable characters of US-ASCII. However, no distinction is made
between use of upper and lower case letters.
with the new text:
Constraints on Registered Names and Aliases
The character set names may be up to 45 characters taken from the printable characters of US-ASCII. As per RFC 2978 no distinction is made between use of upper and lower case letters. While more punctuation characters are permitted by RFC 2978, only the following should be used:
0x43 '+' PLUS SIGN
0x45 '-' HYPHEN-MINUS
0x46 '.' FULL STOP
0x58 ':' COLON
0x95 '_' LOW LINE
In addition, two strings are considered to conflict if after uppercasing them, then removing all characters except A-Z and 0-9, and then removing all leading zeros (zeros not preceded by a number), the strings conflict. No new names or aliases will be accepted for registration that conflict with existing names or aliases, except where they only conflict with aliases for the same name. For example, "IBM-037" is acceptable as an alias for "IBM037", but "roman08" is not acceptable as an alias for "macintosh" because it would conflict with "roman8", which is an existing alias for "hp-roman8".
B. Start the data with "@START_DATA" and ending it with "@END_DATA". Add documentation in the header:
This file is designed to be machine-readable. The data start with the line "@START_DATA", and ends with the line "@END_DATA". Each line of data is of the form:
<tag> ":" <space> value1 <space>+ value2 <space>+ value3
or is a continuation line, starting with <space>. The values are interpreted according to the tags, as follows:
Tag Values
Name: value1 is the name
value2 is either blank, "(preferred MIME name)" or "[" <reference> "]"
value3 is either blank, or "[" <reference> "]"
Alias: value1 is the alias
value2 is either blank, or "(preferred MIME name)"
MIBenum: value1 is a number, described above
Source: value1 is descriptive text. This is the only entry that can have continuation lines.
C. Remove the alias: NF_Z_62-010_(1973)
Mark
___
mark.davis@us.ibm.com
IBM, MS 50-2/B11, 5600 Cottle Rd, SJ CA 95193
(408) 256-3148
fax: (408) 256-0799