[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: shift_jis / windows-31J





On 2010/11/17 8:12, NARUSE, Yui wrote:
> (2010/11/17 8:00), Bjoern Hoehrmann wrote:
>> * Anne van Kesteren wrote:
>>> It would only break if the content they consumed contained code points
>>> that mapped to invalid characters and they relied on that, though.
>>> And the
>>> content was not relying on being mapped to the superset mapping of
>>> Windows-31J instead, which seems far more likely given the dominance of
>>> the Web and Windows.
>>
>> The prime example problem with shift_jis is the ambiguity of the octet
>> 0x5C which maps to a backslash for some and to the yen sign for others.
>> As far as I am aware, 0x5C is not invalid and this particular problem
>> is not a matter of supersets and subsets, you get 0x5C and you do not
>> know whether you should interpret it as yen sign or backslash. And it's
>> not going to change, systems built around one interpretation will use
>> that interpretation, systems built around the other interpretation will
>> stick with their interpretation aswell. If you have two web services
>> that exchange data they may be running on Windows and on the Web, but
>> they may not be using the Windows/Browser/whatever interpretation.
>
> In practice, 0x5C in Shift-JIS is U+005C but yen sign glyph.

Well, but first, please note that it's Shift_JIS, not Shift-JIS (it's 
very easy to make that mistake).

As another example, with gcc on cygwin, in Shift_JIS 0x5C is mapped to 
something else than U+005C. If you want an SJIS program to compile, you 
have to say something like

gcc -finput-charset=CP932

If you say gcc -finput-charset=Shift_JIS, all your \n and similar 
escapes produce errors.

That's different in Ruby, where we always map the 0x00-0x7F range 
straight, so that things work out on a syntax level. (I guess we 
wouldn't do that for ISO 646 related encodings, of course.) But there is 
still a difference between Shift_JIS (no private use characters) and 
Windows-31J (Microsoft). It's actually quite strict, in that if one 
string is labeled Shift_JIS and the other is labeled Windows-31J, they 
don't compare as equal even if they are equal.

Regards,    Martin.

-- 
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:duerst@it.aoyama.ac.jp