[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Thoughts about characters transmission



Although I have no intention to start a religious war, I would like
to point out some technical difficulties with Otha's proposed IUTF.
Please refer to my earlier posting about NET-TEXT for an alternative
proposal.

> BTW, I now think that, if we are to use almost raw UTF2 as interim encoding
> without enough consideration to many languages with non-European characters,
> we should not use two octet UTF2 sequence beginning from T1. That is,
> represent all non-ASCII characrters with three octet form of UTF2. Then,
> the two octet sequences are reserved for the future international
> assignment. Isn't it fair?

UTF-2 (as used by Plan 9, I don't have an X/Open reference) requires that
the *shortest* sequence be used (although programs may not check it),
thus this would make your coding incompatible with UTF-2.

I find your UTF-2 table

>         C0:0~32,127
>         A :33~126
>         Tx:128~191
>         T1:192~223
>         T2:224~239
>         T3:240~247
>         T4:248~251
>         T5:252~253
>         Ty:254~255(unused)

a bit strange. Consider that T5 = 1111110x and the
five following Tx bytes have only 30 bits available: there is no
way to represent codes >= 2^31 (or maybe these don't occur
in ISO 10646; please enlighten me if this is the case).

Your IUTF table was

>         C0:0~32,127
>         A':33~46,48~126
>         C1:128~159
>         Tx:128~191
>         T1:192~223
>         T2:224~239(=S2+S3+S4+S6+S7)
>         S2:224~229
>         S3:230~235
>         S4:236~237
>         S6:238
>         S7:239
>         U1:240~255

I don't see the reason for introducing A'; could you explain please?

You proposed the extra sequences

>         T1 A'                   2976
>         T2 A'                   1488
>         U1 A'                   1488
>         U1 Tx                   1024
>         T1 T2                   512
>         T1 U1                   512
>         U1 T2                   256
>         S2 Tx A'                35712
>         S3 Tx A' Tx             >2^21
>         S4 Tx A' Tx Tx          >2^25
>         S6 Tx A' Tx Tx Tx Tx    >2^36
>         S7 Tx A' Tx Tx Tx Tx Tx >2^42

These sequences destroy the resynchronisation property: consider what
happens if you hit an internal non-Tx byte: how would you know that it
was internal? E.g. consider

	T1 A'		and		T1 T2 A'

The "intended" parsing is

	[T1 A']		and		[T1 T2] [A']

but you could also parse them as

	... T1] [A']	and		... T1] [T2 A']
			and		... T1 T2] [A']

> Hash  tables  could be used for the fast translation from
> ICODE to IUTF for such shorthand notations.

This seems a bit too complex for the purpose.

--
Luc Rooijakkers                                 Internet: lwj@cs.kun.nl
SPC Group, the Netherlands                      UUCP: uunet!cs.kun.nl!lwj