This is the mail archive of the cygwin mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: command line argument parsing get extra ^X for Chinese characters when started from native win app


On Dec 24 15:36, Xuefer wrote:
> tested with
> $ uname -a
> CYGWIN_NT-6.1 mOo-PC 1.7.27(0.271/5/3) 2013-12-09 11:54 x86_64 Cygwin
> 
> run the following code in .bat file, the file should be in GBK
> encoding. as your system should be GBK encoding by default to parse
> the batch file correctly
> or copy paste the code to start->run
> ==[ to get actual wrong output ]
> c:\app\cygwin\bin\env LANG=zh_CN.UTF-8 PATH=/usr/bin bash -c "echo äæ;
> echo äæ > a.txt; cat a.txt; xxd a.txt; echo please vim a.txt; sh"
> ===============
> 
> ==[  actual output ]
>  ä æ
>  ä æ
> 0000000: 18e4 b8ad 18e6 9687 0a                   .........
> please vim a.txt
> sh-4.1$
> ===============
> now when you do "vim a.txt", you see
> a.txt
> ^Xä^Xæ

I'm sorry, but I have a hard time testing this.  I don't have a system,
which allows to switch the console to codepage 936, which would be
required to give this a try.  Also, the a.bat.txt file you attached to
your mail seems to be broken.  The characters in the `echo' commands
seem to consist of four 0x3f hex values, which is probably not what you
wanted.  This doesn't look like valid GBK encoding.

I have a hunch what the problem might be, though.

When you start the batch file, you don't have any POSIX environment
variable set to tell Cygwin which codeset you're using.  The first
process started here is `env'.  When you set LANG, it's env doing this,
but it does so only *after* reading the command line.  Env itself will
use what is set in the environment prior to starting env.  So when env
evaluates the command line, it assumes that the Cygwin locale is
supposed to be set to "C" or "POSIX", which is ASCII-only per POSIX.  In
that case, all non-ASCII chars in the input will be converted to
replacement byte values, starting with ^X (== 0x18), followed by the
UTF-8 value of the input character.  That's what you see.

If my hunch is more or less correct, a workaround would be to make sure
the LANG or LC_CTYPE variable is set before calling the first Cygwin
process.  So, please change your bat file to something like this and try
again:

  set LC_CTYPE=zh_CN.UTF-8
  c:\app\cygwin\bin\env PATH=/usr/bin bash -c "echo äæ;
  echo äæ > a.txt; cat a.txt; xxd a.txt; echo please vim a.txt; sh"


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Maintainer                 cygwin AT cygwin DOT com
Red Hat

Attachment: pgpuL3lhTWdV5.pgp
Description: PGP signature


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]