This is the mail archive of the
cygwin-developers
mailing list for the Cygwin project.
Re: "C" UTF-8 trouble
On Oct 7 07:07, Andy Koppe wrote:
> 2009/10/7 Eric Blake:
> > For the problematic apps, are they checking just the environment
> > variables, or are they using setlocale(,NULL) and/or setlocale(,"") to
> > determine the current/default settings?
>
> Looking into this question, I found that for vim there's actually a
> completely different culprit: nl_langinfo(CODESET) returns "US-ASCII"
> for the C locale. (It also returns incorrect values for other
> charset-less locales.)
>
> Hence I replaced the code in nl_langinfo's CODESET case with just 'ret
> = __locale_charset()', and vim's fine!
Urgh. So we have to change nl_langinfo in newlib as well. Do we have
to return "US-ASCII" if charset is "ASCII", or is it sufficient to
return __locale_charset() as you did, thus returning "ASCII" for "ASCII"?
And what about stuff like "eucJP" vs. "EUCJP"? The charset in newlib
is always uppercase right now.
> Unfortunately that's not the case for emacs.
<insert obligatory editor dispute here>
> > Anyone using _just_ the
> > environment variables is doomed to failure. ?POSIX states:
> >
> > "If the LANG environment variable is not set or is set to the empty
> > string, the implementation-defined default locale shall be used."
> >
> > My preference would be that if the environment variables were not set when
> > cygwin1.dll started, then setlocale(,NULL) returns "C.UTF-8" rather than
> > "C".
>
> The way I understand it, setlocale(,NULL) only queries the current
> setting and has to return "C" (or "POSIX") in the initial state.
>
> But you're right regarding setlocale(,""); that could indeed return
> something else if none of the environment variables is set. From
> http://www.opengroup.org/onlinepubs/7990989775/xbd/locale.html:
>
> "All implementations define a locale as the default locale, to be
> invoked when no environment variables are set, or set to the empty
> string. This default locale can be the POSIX locale or any other,
> implementation-dependent locale."
>
> I think this a good idea, so I replaced "C" with "C.UTF-8" at the end
> of __get_locale_env. Yet emacs still doesn't behave, and digging into
> its code I found that it does indeed read the env variables directly.
> :(
>
> ;; Use the first of these three environment variables
> ;; that has a nonempty value.
> (let ((vars '("LC_ALL" "LC_CTYPE" "LANG")))
> (while (and vars
> (= 0 (length locale))) ; nil or empty string
> (setq locale (getenv (pop vars) frame)))))
I, too, think this is a good idea. __get_locale_env() should be changed
to return "C.UTF-8".
As for Emacs, I'm wondering if it shouldn't be changed to set its locale
according to setlocale(LC_CTYPE,NULL) instead, given what POSIX says.
It would be nice to check /etc/defaults/locale in __get_locale_env() as
well, but I'm a bit reluctant to do that. It means, every invocation of
a Cygwin process has to open that file if the environment isn't set.
Talking about performance...
Alternatively, the first invocation of Cygwin in a process tree could
try to read this file only.
For a start, here's a first untested cut at newlib's locale.c, which
allows us to add any desired mechanism to switch the default locale.
The comment is already jumping ahead ab bit...:
Index: libc/locale/locale.c
===================================================================
RCS file: /cvs/src/src/newlib/libc/locale/locale.c,v
retrieving revision 1.28
diff -u -p -r1.28 locale.c
--- libc/locale/locale.c 29 Sep 2009 19:12:28 -0000 1.28
+++ libc/locale/locale.c 7 Oct 2009 08:57:12 -0000
@@ -205,6 +205,21 @@ static char *categories[_LC_LAST] = {
};
/*
+ * Default locale per POSIX.
+ */
+#ifdef __CYGWIN__
+#define DEFAULT_LOCALE "C.UTF-8"
+#else
+#define DEFAULT_LOCALE "C"
+#endif
+/*
+ * This variable can be changed by any outside mechanism. This allows,
+ * for instance, to load the default locale from a file. On Cygwin,
+ * we're using /etc/defaults/locale for that.
+ */
+char __default_locale[ENCODING_LEN + 1] = DEFAULT_LOCALE;
+
+/*
* Current locales for each category
*/
static char current_categories[_LC_LAST][ENCODING_LEN + 1] = {
@@ -733,7 +748,7 @@ __get_locale_env(struct _reent *p, int c
/* 4. if none is set, fall to "C" */
if (env == NULL || !*env)
- env = "C";
+ env = __default_locale;
return env;
}
If you agree to this, I'll propose it on the newlib list.
Corinna
--
Corinna Vinschen Please, send mails regarding Cygwin to
Cygwin Project Co-Leader cygwin AT cygwin DOT com
Red Hat