[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: some IANA registrations look like repertoires not charsets?



ned.freed@mrochek.com wrote:

> Assuming the intent really was to register repetoires seems like a 
> stretch to me.


I believe that is possible. I am trying to figure out what the intent was. I am not saying that we must assume right away that these names are not charsets.
The reference to ISO 10646 collections and IBM GCSGIDs however _suggests_ that these are just repertoires.


>> Without any specified encoding scheme, they would not qualify as 
>> charsets.
> 
> It isn't particularly relevant to the matter at hand, but the fact of the
> matter is that a charset doesn't require an encoding scheme. The 
> requirement is
> instead that there be a mapping from octets to characters. Whether this is

> implemented by means of a CCS/CES pair or something else is up to the


An encoding scheme is nothing but an algorithm for going from bytes to characters. "a charset doesn't require an encoding scheme" and "there be a mapping from octets to characters" are therefore contradictory.
Without an encoding scheme, there is no way to decode a byte stream.


> registration. Charsets like iso-2022-jp certainly don't consist of a single
> CCS/CES pair.


We all know that a number of charsets combine one CES with multiple CCSes. Without that CES you would not have a charset, though.
We could argue if there is one CES with sub-CESes or a CES with CEFs (a little like debating ISO/OSI vs. TCP stack), but at the minimum you need that one lowest-level CES to dissect the byte stream into meaningful units.

Importantly, an IBM GCSGID does not even specify a CCS because it does not map abstract characters to any kind of codes.
(An ISO 10646 collection is a CCS however.)

It is of course possible that the IANA character-sets list is supposed to list not only things that are "charsets" but also CCSes and CEFs and repertoires.
If so, then please add clarifying text to the top of the list document, and appropriate classification to at least non-charset entries.


> More likely it was assumed the encoding was implied by the registration.


That would be good and valid, and I am trying to ascertain what encoding if any was implied.


> In any case, past attempts to clean up the registry haven't been 
> successful.
> And given that actual use of any of this junk is unlikely to exist, it
> hasn't proved to be sufficiently problematic to force the issue.


That is a sad statement. It puts a big disclaimer onto the IANA charset list that diminishes its value, in my opinion.

Best regards,

markus