This is the mail archive of the cygwin mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: The C locale


2009/9/24 Corinna Vinschen <corinna-cygwin@cygwin.com>:
> On Sep 24 16:03, IWAMURO Motonori wrote:
>> 2009/9/22 Andy Koppe <andy.koppe@gmail.com>:
>> > Let's use the Windows "ANSI" codepage as the character set for the C
>> > locale, for both the conversion functions and filenames. This means
>> > CP1252 on Western systems, CP1251 on Cyrillic ones, CP932 on Japanese
>> > ones, and so on.
>>
>> I oppose the approach (the ANSI codepage is used at C locale) because
>> CP932 (the codepage for Japanese) is hostile to the UNIX-like tools.
>>
>> The reason is that the CP932 format contains a lot of meta characters
>> as follows.
>>
>>   single character of CP932:
>> /[\x00-\x7F\xA0-\xDF]|[\x81-\x9F\xE0-\xFC][\x40-\x7E\x80-\xFC]/
>
> I don't understand.  Are you saying that the single character in CP932
> consists of 12 bytes?  As far as I can see, CP932 is S-JIS, which
> is a just a simple double byte character set.  What am I missing.

- CP932 (Shift_JIS) has 1byte character and 2bytes character.

- The range of 1byte character is 0x00-0x7F and 0xA0-0xDF.

- The range of first byte of 2byte character is 0x80-0x9F and 0xE0-0xFC.

- The range of second byte of 2byte character is 0x40-7E and 0x80-0xFC.
  This includes "[", "\", "]", "^", "`", "{", "|", "}".

A lot of problems of the tools (don't see locale and use escaped
string, globbing or regexp) are caused by the last fact.

- Can't open file or directory.
- Destroy filenames.
- Lost files.

For example:

Case1: The CP932 byte sequence of "éçè.xls" is 8D 80 96 DA 95 *5C*
(=='\') 2E 78 6C 73. When this character string is treated as a
character string with the escape without locale, 0x5C disappears.

Case2: When use regexp of /ãããã/, I expect that it matches the
character strings including "ãããã". But, the tools (don't see locale)
treat as /ã\x83|ãã/ because the byte sequence of "ãããã" is 83 58 83
*7C* (=='|') 83 62 83 67. As a result, the strings not expected are
matched.

Case3: When use glob of "ããã0[0-9].dat", it treated as
"ã\x81[\x83^0[0-9].dat". As a result, the files expected are not
matched.
-- 
IWAMURO Motnori <http://vmi.jp/>

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]