This is the mail archive of the
cygwin-developers
mailing list for the Cygwin project.
utf-8 and cygwin
- From: Gregg Tavares <unison at greggman dot com>
- To: cygwin-developers at cygwin dot com
- Date: Thu, 27 Dec 2007 16:47:24 -0800 (PST)
- Subject: utf-8 and cygwin
Hello, I'm new to the list and I hope I can be helpful.
I got here by trying to get rsync and then unison to work to sync my music files which contain lots of Japanese filenames between fc6 and XP
I narrowed the problems down to 2 things I think.
#1) no utf-8 support in cygwin
I see that someone says they are working on it for a future release but a search also brought up this patch
http://www.okisoft.co.jp/esc/utf8-cygwin/
which already does that. I thought maybe whoever is working on future utf-8 support might want to look at that.
after trying that out both unison and rsync started working with out a recompile for short filenames which brings up the second issue
#2) The filename size limit in cygwin is too short.
Unfortunately the names had to be pretty short. I think the issue is both that cygwin has a maximum filename limit that is too short? And that secondly, whatever it is set to, UTF-8 names will be longer than that limit. If I remember correctly, on unicode character can end up being up to 4 bytes in UTF-8. That means a for example a typical MP3 Japanese filename stored by Album-Name/Song-Name after being expanded to UTF-8 will easily be larger than 255 bytes. The check for size overflow comes before the UTF-8 is converted to widebyte UTF-16. I believe typically one Japanese character will be 4 bytes in UTF-8 so for example a Japanese 120 unicode character path could be easily 480 bytes of UTF-8.
The question I have then is I'm not that familiar with the cygwin source or the issues involved in changing the filename limit
For example MAXPATHLEN is defined in winsup/cygwin/include/sys/param.h as (260 -1)
and PATH_MAX is defined in winsup/cygwin/includes/limits.h as 260
and NAME_MAX is defined also in winsup/cygwin/includes/limits.h as 255
and there's even _POSIX_PATH_MAX 255
It seems like in order to get UTF-8 to work all of those have to change by 4x. Except maybe the _POSIX_PATH_MAX although even that by name seems like it should be changed. Is that the MAX for POSIX is is that the MIN for POSIX?
I'm not sure how I can help. If someone is already intergrating UTF-8 support I don't want to step on any toes. The most I can do then is help with testing. Otherwise, I was going to suggest adding the patches above and increasing those limits if they won't break anything.
Thoughts? Suggestions?