[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: A spec for showing language in MIME headers



David writes:

> Languages do evolve.  Mainly in idioms but also pronunciations &c. 
> Therefore the a system for tagging such things should most definitely have a
> place for placing markings as to the "version" of the language.  Calling it "
> old" versus "middle" probably is not sufficient, depending on what level of
> detail you want to support. ...

The rejected ISO draft for three-letter codes included some
historical languages codes:

   dum   Dutch, Middle (ca. 1050-1350)
   egy   Egyptian (Ancient)
   enm   English, Middle (1100-1500)
   ang   English, Old (ca 450-1100)
   frm   Rench, Middle (ca. 1400-1600)
   fro   French, Old (842-ca. 1400)
   gmh   German, Middle High (ca. 1050-1500)
   goh   German, Old High (ca. 750-1050)
   grc   Greek, Ancient (to 1453)
   sga   Irish, Old (to 900)
   mga   Irish, Middle (900-1200)
   lat   Latin
   non   Norse, Old
   peo   Persian, Old (ca. 600-400 B.C.)
   phn   Phoenician
   pro   Provenc,al, Old (to 1500)
   san   Sanskrit
   ota   Turkish, Ottoman (1500-1928)   

If somebody feels a strong need for making such distinctions, 
language variant codes or, in some cases new language codes, 
should be possible to register with IANA.

> To my knowledge choosing the right glyphs is driven by the character set.

Not always ...

> So what we need is a sufficient quantity of character sets so we can discuss
> old high germanic names in one paragraph, old english in the next, and
> russian after that.  Where does the need for marking the languages come from?

No, you will not find any coded character set capable of 
distinguishing between an Old High German "A", an Old English 
"A", and a modern "A". This is _not_ a result of the imperfect 
level of development of coded character sets, however.

Your mistake is a confusion of text representation levels:

1) In all existing coded character sets only the content of a 
   text is encoded. This is what you get if you use _plain text_:
   Only those distinctions necessary to make the text legible is
   coded.

2) To also keep such _rich text properties_ of text as 
   italicization, boldness, smaller or bigger character size, 
   language-correct choice of glyphs, correct hyphenation 
   behavior, you can't remain on the basic plain text level, but 
   must enter a higher rich text level.

3) In most existing rich text formats these text properties are 
   represented by some kind of mark-up of the plain text. This 
   is certainly the case for the SGML-based TEI encoding system 
   developed to meet the needs of linguists.

4) There are also very sound technical reasons for not including 
   a bit for each binary rich text property in the bit sequence 
   representing a character in a coded character set. These 
   properties, including language, do very seldom vary between 
   each character. They are constant for a chunk of the text, 
   sometimes of considerable length.

> >I _would_ support making the country code into something that should be used
> >only if it is absolutely necessary to disambiguate different usages of the
> >same language.  e.g. French and French-Canadian which have different
> >capitalisation rules I believe. ...
> - - -
> Hmmm...  I don't see this.
> 
> Isn't capitalization done within the text?  `a' is a different character
> code than `A' after all...

Not capitalization perhaps, but hyphenation rules may be 
different, and they are important when text is displayed with a 
different window width or font than that used originally.

/Olle