[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: Thoughts about characters transmission
Although I have no intention to start a religious war, I would like
to point out some technical difficulties with Otha's proposed IUTF.
Please refer to my earlier posting about NET-TEXT for an alternative
proposal.
> BTW, I now think that, if we are to use almost raw UTF2 as interim encoding
> without enough consideration to many languages with non-European characters,
> we should not use two octet UTF2 sequence beginning from T1. That is,
> represent all non-ASCII characrters with three octet form of UTF2. Then,
> the two octet sequences are reserved for the future international
> assignment. Isn't it fair?
UTF-2 (as used by Plan 9, I don't have an X/Open reference) requires that
the *shortest* sequence be used (although programs may not check it),
thus this would make your coding incompatible with UTF-2.
I find your UTF-2 table
>         C0:0~32,127
>         A :33~126
>         Tx:128~191
>         T1:192~223
>         T2:224~239
>         T3:240~247
>         T4:248~251
>         T5:252~253
>         Ty:254~255(unused)
a bit strange. Consider that T5 = 1111110x and the
five following Tx bytes have only 30 bits available: there is no
way to represent codes >= 2^31 (or maybe these don't occur
in ISO 10646; please enlighten me if this is the case).
Your IUTF table was
>         C0:0~32,127
>         A':33~46,48~126
>         C1:128~159
>         Tx:128~191
>         T1:192~223
>         T2:224~239(=S2+S3+S4+S6+S7)
>         S2:224~229
>         S3:230~235
>         S4:236~237
>         S6:238
>         S7:239
>         U1:240~255
I don't see the reason for introducing A'; could you explain please?
You proposed the extra sequences
>         T1 A'                   2976
>         T2 A'                   1488
>         U1 A'                   1488
>         U1 Tx                   1024
>         T1 T2                   512
>         T1 U1                   512
>         U1 T2                   256
>         S2 Tx A'                35712
>         S3 Tx A' Tx             >2^21
>         S4 Tx A' Tx Tx          >2^25
>         S6 Tx A' Tx Tx Tx Tx    >2^36
>         S7 Tx A' Tx Tx Tx Tx Tx >2^42
These sequences destroy the resynchronisation property: consider what
happens if you hit an internal non-Tx byte: how would you know that it
was internal? E.g. consider
	T1 A'		and		T1 T2 A'
The "intended" parsing is
	[T1 A']		and		[T1 T2] [A']
but you could also parse them as
	... T1] [A']	and		... T1] [T2 A']
			and		... T1 T2] [A']
> Hash  tables  could be used for the fast translation from
> ICODE to IUTF for such shorthand notations.
This seems a bit too complex for the purpose.
--
Luc Rooijakkers                                 Internet: lwj@cs.kun.nl
SPC Group, the Netherlands                      UUCP: uunet!cs.kun.nl!lwj