This is the mail archive of the cygwin mailing list for the Cygwin project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: The C locale

From: Corinna Vinschen <corinna-cygwin at cygwin dot com>
To: cygwin at cygwin dot com
Date: Thu, 24 Sep 2009 11:57:01 +0200
Subject: Re: The C locale
References: <416096c60909012329l2f25e735yc07145b8d6698cda@mail.gmail.com> <3f0ad08d0909020656v7d9fce6ft4afea63ed363b9a9@mail.gmail.com> <416096c60909071308qc5ff057sbe9cb1dbc270554f@mail.gmail.com> <20090908193456.GC17515@calimero.vinschen.de> <416096c60909081449r1fe024dbm7b82a3719be05e9e@mail.gmail.com> <20090921103758.GE20981@calimero.vinschen.de> <416096c60909211420g4ac8ea93l80fc1f00dcd5c0f3@mail.gmail.com> <3f0ad08d0909240003j435818e7h6f7cde2e26188f7e@mail.gmail.com> <20090924073441.GA30267@calimero.vinschen.de> <3f0ad08d0909240237s518de248jee409b731711404a@mail.gmail.com>
Reply-to: cygwin at cygwin dot com

On Sep 24 18:37, IWAMURO Motonori wrote:
> 2009/9/24 Corinna Vinschen <corinna-cygwin@cygwin.com>:
> > On Sep 24 16:03, IWAMURO Motonori wrote:
> >> 2009/9/22 Andy Koppe <andy.koppe@gmail.com>:
> >> > Let's use the Windows "ANSI" codepage as the character set for the C
> >> > locale, for both the conversion functions and filenames. This means
> >> > CP1252 on Western systems, CP1251 on Cyrillic ones, CP932 on Japanese
> >> > ones, and so on.
> >>
> >> I oppose the approach (the ANSI codepage is used at C locale) because
> >> CP932 (the codepage for Japanese) is hostile to the UNIX-like tools.
> >>
> >> The reason is that the CP932 format contains a lot of meta characters
> >> as follows.
> >>
> >>   single character of CP932:
> >> /[\x00-\x7F\xA0-\xDF]|[\x81-\x9F\xE0-\xFC][\x40-\x7E\x80-\xFC]/
> >
> > I don't understand.  Are you saying that the single character in CP932
> > consists of 12 bytes?  As far as I can see, CP932 is S-JIS, which
> > is a just a simple double byte character set.  What am I missing.
> 
> - CP932 (Shift_JIS) has 1byte character and 2bytes character.
> 
> - The range of 1byte character is 0x00-0x7F and 0xA0-0xDF.
> 
> - The range of first byte of 2byte character is 0x80-0x9F and 0xE0-0xFC.
> 
> - The range of second byte of 2byte character is 0x40-7E and 0x80-0xFC.
>   This includes "[", "\", "]", "^", "`", "{", "|", "}".

Ok, thanks for your examples, they show neatly where the problem is.

As you might know, the codepage 20932 (EUC-JP) is also not the same
as the UNIX EUC_JP implementation.  The JIS-X-0212 three byte codes
are folded into two-byte sequences as described in a comment in
strfuncs.cc:

  /* Unfortunately, the Windows eucJP codepage 20932 is not really 100%
     compatible to eucJP.  It's a cute approximation which makes it a
     doublebyte codepage.
     The JIS-X-0212 three byte codes (0x8f,0xa1-0xfe,0xa1-0xfe) are folded
     into two byte codes as follows: The 0x8f is stripped, the next byte is
     taken as is, the third byte is mapped into the lower 7-bit area by
     masking it with 0x7f.  So, for instance, the eucJP code 0x8f,0xdd,0xf8
     becomes 0xdd,0x78 in CP 20932.

     To be really eucJP compatible, we have to map the JIS-X-0212 characters
     between CP 20932 and eucJP ourselves. */

My question is this:  Is the S-JIS implementation on UNIX systems
also using a different implementation to avoid using characters
from the ASCII range?  If so, can't we change the __sjis_wctomb
and __sjis_mbtowc functions in the same manner as the __eucjp_wctomb
and __eucjp_mbtowc functions to get a safer implementation?


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

Follow-Ups:
- Re: The C locale
  - From: Corinna Vinschen
- Re: The C locale
  - From: IWAMURO Motonori

References:
- Re: The C locale
  - From: Andy Koppe
- Re: The C locale
  - From: IWAMURO Motonori
- Re: The C locale
  - From: Andy Koppe
- Re: The C locale
  - From: Corinna Vinschen
- Re: The C locale
  - From: Andy Koppe
- Re: The C locale
  - From: Corinna Vinschen
- Re: The C locale
  - From: Andy Koppe
- Re: The C locale
  - From: IWAMURO Motonori
- Re: The C locale
  - From: Corinna Vinschen
- Re: The C locale
  - From: IWAMURO Motonori

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]