[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: Thoughts about characters transmission
Although I have no intention to start a religious war, I would like
to point out some technical difficulties with Otha's proposed IUTF.
Please refer to my earlier posting about NET-TEXT for an alternative
proposal.
> BTW, I now think that, if we are to use almost raw UTF2 as interim encoding
> without enough consideration to many languages with non-European characters,
> we should not use two octet UTF2 sequence beginning from T1. That is,
> represent all non-ASCII characrters with three octet form of UTF2. Then,
> the two octet sequences are reserved for the future international
> assignment. Isn't it fair?
UTF-2 (as used by Plan 9, I don't have an X/Open reference) requires that
the *shortest* sequence be used (although programs may not check it),
thus this would make your coding incompatible with UTF-2.
I find your UTF-2 table
> C0:0~32,127
> A :33~126
> Tx:128~191
> T1:192~223
> T2:224~239
> T3:240~247
> T4:248~251
> T5:252~253
> Ty:254~255(unused)
a bit strange. Consider that T5 = 1111110x and the
five following Tx bytes have only 30 bits available: there is no
way to represent codes >= 2^31 (or maybe these don't occur
in ISO 10646; please enlighten me if this is the case).
Your IUTF table was
> C0:0~32,127
> A':33~46,48~126
> C1:128~159
> Tx:128~191
> T1:192~223
> T2:224~239(=S2+S3+S4+S6+S7)
> S2:224~229
> S3:230~235
> S4:236~237
> S6:238
> S7:239
> U1:240~255
I don't see the reason for introducing A'; could you explain please?
You proposed the extra sequences
> T1 A' 2976
> T2 A' 1488
> U1 A' 1488
> U1 Tx 1024
> T1 T2 512
> T1 U1 512
> U1 T2 256
> S2 Tx A' 35712
> S3 Tx A' Tx >2^21
> S4 Tx A' Tx Tx >2^25
> S6 Tx A' Tx Tx Tx Tx >2^36
> S7 Tx A' Tx Tx Tx Tx Tx >2^42
These sequences destroy the resynchronisation property: consider what
happens if you hit an internal non-Tx byte: how would you know that it
was internal? E.g. consider
T1 A' and T1 T2 A'
The "intended" parsing is
[T1 A'] and [T1 T2] [A']
but you could also parse them as
... T1] [A'] and ... T1] [T2 A']
and ... T1 T2] [A']
> Hash tables could be used for the fast translation from
> ICODE to IUTF for such shorthand notations.
This seems a bit too complex for the purpose.
--
Luc Rooijakkers Internet: lwj@cs.kun.nl
SPC Group, the Netherlands UUCP: uunet!cs.kun.nl!lwj