[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: A spec for showing language in MIME headers
David writes:
> Languages do evolve. Mainly in idioms but also pronunciations &c.
> Therefore the a system for tagging such things should most definitely have a
> place for placing markings as to the "version" of the language. Calling it "
> old" versus "middle" probably is not sufficient, depending on what level of
> detail you want to support. ...
The rejected ISO draft for three-letter codes included some
historical languages codes:
dum Dutch, Middle (ca. 1050-1350)
egy Egyptian (Ancient)
enm English, Middle (1100-1500)
ang English, Old (ca 450-1100)
frm Rench, Middle (ca. 1400-1600)
fro French, Old (842-ca. 1400)
gmh German, Middle High (ca. 1050-1500)
goh German, Old High (ca. 750-1050)
grc Greek, Ancient (to 1453)
sga Irish, Old (to 900)
mga Irish, Middle (900-1200)
lat Latin
non Norse, Old
peo Persian, Old (ca. 600-400 B.C.)
phn Phoenician
pro Provenc,al, Old (to 1500)
san Sanskrit
ota Turkish, Ottoman (1500-1928)
If somebody feels a strong need for making such distinctions,
language variant codes or, in some cases new language codes,
should be possible to register with IANA.
> To my knowledge choosing the right glyphs is driven by the character set.
Not always ...
> So what we need is a sufficient quantity of character sets so we can discuss
> old high germanic names in one paragraph, old english in the next, and
> russian after that. Where does the need for marking the languages come from?
No, you will not find any coded character set capable of
distinguishing between an Old High German "A", an Old English
"A", and a modern "A". This is _not_ a result of the imperfect
level of development of coded character sets, however.
Your mistake is a confusion of text representation levels:
1) In all existing coded character sets only the content of a
text is encoded. This is what you get if you use _plain text_:
Only those distinctions necessary to make the text legible is
coded.
2) To also keep such _rich text properties_ of text as
italicization, boldness, smaller or bigger character size,
language-correct choice of glyphs, correct hyphenation
behavior, you can't remain on the basic plain text level, but
must enter a higher rich text level.
3) In most existing rich text formats these text properties are
represented by some kind of mark-up of the plain text. This
is certainly the case for the SGML-based TEI encoding system
developed to meet the needs of linguists.
4) There are also very sound technical reasons for not including
a bit for each binary rich text property in the bit sequence
representing a character in a coded character set. These
properties, including language, do very seldom vary between
each character. They are constant for a chunk of the text,
sometimes of considerable length.
> >I _would_ support making the country code into something that should be used
> >only if it is absolutely necessary to disambiguate different usages of the
> >same language. e.g. French and French-Canadian which have different
> >capitalisation rules I believe. ...
> - - -
> Hmmm... I don't see this.
>
> Isn't capitalization done within the text? `a' is a different character
> code than `A' after all...
Not capitalization perhaps, but hyphenation rules may be
different, and they are important when text is displayed with a
different window width or font than that used originally.
/Olle