[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: ignore dashes etc. (was Registration of new charset GB18030 (fwd))



[changing subject line because there were 3 threads for the old one]

I would like to add some details to this for ICU.

As Mark said, the ICU code ignores dashes and case differences in encoding names.
In addition, it also ignores underscores and spaces.

The idea behind ignoring spaces was that users can type a name in a UI that contains spaces, like "GB 18030".
Dashes and underscores seem to be commonly ignored.
Spaces might be controversial.
We do not use spaces in our charset names, we only ignore them when matching.

I second the various proposals to make the IANA charset matching rules more lenient.

To make a complete proposal:

I propose that charset names should be recommended to be matched ignoring
the following:
- letter case differences (A=a, B=b, ... for A-Z and a-z)
- dashes '-'
- underscores '_'
- spaces ' '

For example, the following all match "gb18030":
     "GB 18030" "gB-18030" "Gb_18030" "_ -g b-1_8 0-3_0 -_"

I can live without the spaces in this recommendation, although I think it could be useful and does no harm.
Spaces are not allowed in IANA charset names, so they can only occur in user-supplied names.

markus

Lars Marius Garshol wrote:

> * Martin Duerst
> | It may be possible to add a rule to the IANA registry that there
> | should be no registrations that only differ in hyphens or
> | underscores.
> 
> I think that would be a good idea. ...