This is the mail archive of the cygwin-developers mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Console codepage setting via chcp?


2009/9/26 Corinna Vinschen:
>> >> - System objects will always be translated using UTF-8. This includes
>> >> file names, user names, and initial environment variables (and
>> >> probably more I'm not aware of).
>> >[...]
>> The downside, of course, is that non-ASCII filenames created in a
>> non-UTF8 locale won't show up correctly in Windows, and vice versa.
>> But that's the same on Linux if the global setting is UTF-8 while the
>> terminal is set to something else. And the stock answer to any
>> complaints will be: Use UTF-8!
>>
>> In any case, the DCxx scheme will ensure that things work correctly
>> within any particular locale.
>>
>> And I guess the ^N scheme can go (or be disabled)?
>
> Probably not. ÂI spent some more time thinking about the various
> scenarios (partly instead of sleeping) and it occured to me that using
> UTF-8 exclusively is a nice dream.

So at least you enjoyed the few hours of sleep you did get then. ;)


> Still, what about your tar example given in
> http://cygwin.com/ml/cygwin-developers/2009-09/msg00043.html?

I suspect the interop between non-UTF8 Cygwin and native Windows is
more likely to draw complaints. In particular, rxvt users would be out
of luck in that respect, since UTF8 isn't going to be an option there.


> If we stick to UTF-8 exclusively we *have* to create the convmv-like
> tool which allows to convert "broken" filenames to be converted from the
> \016\377\x notation to the UTF-8 \c2\x or \c3\x notation, otherwise.

What's the \016\377\x notation? \016 is ^N, but the \377 isn't UTF-8,
so is that an additional scheme?

The way I understand it though, if filenames were always treated as
UTF8 by the system calls, then ^N would never be needed, because
invalid UTF8 is encoded as U+DCxx when converting to UTF16, while
UTF16-to-UTF8 is always valid (unless Windows filenames contain
invalid UTF16 in the first place ...).

Therefore, I think the standard 'convmv' should be able to do the job.
I've had a quick look at it: it's a perl script, and seems to be
fairly straightforward to use, for example:

./convmv -f ISO-8859-1 -t UTF-8 bÃh
Starting a dry run without changes...
mv "./bÃh"      "./bÃÂh"

'LC_CTYPE=ISO-8859-1 tar ..." would still be nicer though.


> What's the right thing to do? ÂI'm still unsure.ÂWith your proposal,
> it's at least the user choose and if some interoperability issue occurs
> and the user complains, we can point to the FAQ: "Use UTF-8, dumbass!"

Yep.


> - System objects will always be *initially* translated using UTF-8. This
> Âincludes file names, user names, and initial environment variables.
> - By setting the locale environ variables you can switch the charset
> Âused to translate filenames on a per-process base.
> ÂThis would be only a stop-gap measure, to allow to re-use old archives
> Âor scripts. ÂThose should be converted to UTF-8 ASAP. ÂExpect complaints.
> - The "C" locale's charset will be UTF-8.
> - There'll be language-neutral "C.<charset>" locales.
> - The user's ANSI codepage will remain the default charset for
> "language_TERRITORY" locales.
> - The console charset will be set according to LC_ALL/LC_CTYPE/LANG
> Âat the time the application starts.
> - setlocale() will (probably) have no effects beyond what's expected in Linux.
>
> Please vote.

I vote for the proposal here, with added fence-sitting in the form of
a CYGWIN option called 'filename_charset' (or some such) taking
precedence over LC_ALL/LC_CTYPE/LANG.

With that, setting 'CYGWIN=fncset:UTF-8' would yield
http://cygwin.com/ml/cygwin-developers/2009-09/msg00050.html.

Andy


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]