This is the mail archive of the cygwin-developers mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: utf-8 and cygwin


How is the conversion to UCS-2 supposed to help?

All the programs dependent on cygwin use 8bit strings. gcc, emacs, ls, python, ocaml, rsync, ssh, etc etc etc.

That means between cygwin and those programs something 8bit (not UCS-2) has to be passed back and forth

So, even if you change cygwin to use UCS-2 internally at some point they have to convert to an 8bit format whether it's UTF-8 or some other encoding

Many if not most of those programs have a place where they interface with the OS, in this case that OS is cygwin and they allocate filename buffers by using PATH_MAX. In order for them to handle the filename after it's been converted from UCS-2 to whatever 8bit encoding you pass to them PATH_MAX has to be set to 128k because if you get a 32k UCS-2 string out of NT and convert it to one of those 8bit encodings it will be MORE than 32k 8bit characters and it won't fit in the buffers those programs are compiling against.

As for it not making sense to use UTF-8. UTF-8 is the only viable solution to pass between cygwin and its client programs if you want cygwin to be able to handle more than one language at once. I use Japanese most of the time so I most of the time my filenames would be fine being manipulated in iso-2022-jp but as soon as I put one filename in Chinese or Korean  all of a sudden cygwin would become useless unless it has the ability to use UTF-8.

With UTF-8 I could rsync or unison between any 2 computers Linux <-> Windows <-> OS-X and it would handle all filenames in all languages. Without UTF-8 support that becomes impossible.

I'm sorry if I'm missing something fundamental but I'm not dropping the topic yet because given the scant information I have and given there has been no pointers to any previous discussion I still don't understand how my current mental image of the proposed solution actually solves the problems that need fixing.


----- Original Message ----
> From: Brian Dessent <brian@dessent.net>
> To: cygwin-developers@cygwin.com
> Sent: Saturday, December 29, 2007 9:46:34 AM
> Subject: Re: utf-8 and cygwin
> 
> 
> > #1 is that that NT/XP limit is 32000 UTF-16 wide characters. Expanded to
> UTF-8 that makes the longest name 128k so if you really want this to work for
> 32K character names PATH_MAX is going to have to be 128K.
> 
> It doesn't make any sense to use UTF-8 in Cygwin. Nowhere in the Win32
> api or the Native API does any function take or output UTF-8, so there
> would be a useless conversion before calling *any* system function. The
> whole point of this painful conversion is to use the same encoding
> throughout in Cygwin as the operating system, namely UCS-2.
> 
> I think Corinna might have more to say as she's been doing the bulk of
> the work but I believe she's on vacation.
> 
> Brian
>


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]