Internationalization

Overview

Internationalization support is controlled by the LANG and LC_xxx environment variables. You can set all of them but Cygwin itself only honors the variables LC_ALL, LC_CTYPE, and LANG, in this order, according to the POSIX standard. The content of these variables should follow the POSIX standard for a locale specifier. The correct form of a locale specifier is

  language[[_TERRITORY][.charset][@modifier]]

"language" is a lowercase two character string per ISO 639-1, "TERRITORY" is an uppercase two character string per ISO 3166, charset is one of a list of supported character sets, and the modifier doesn't matter here (though it might for some applications). If you're interested in the exact description, you can find it in the online publication of the POSIX manual pages on the homepage of the Open Group.

Typical locale specifiers are

  "de_CH"	   language = German, territory = Switzerland, default charset
  "fr_FR.UTF-8"    language = french, territory = France, charset = UTF-8
  "ko_KR.eucKR"    language = korean, territory = South Korea, charset = eucKR

At application startup, the application's locale is set to the default "C" or "POSIX" locale. Under Cygwin, this locale defaults to the UTF-8 character set. If you want to stick to the "C" locale and only change to another charset, you can define this by setting one of the locale environment variables to "C.charset". For instance

  "C.ISO-8859-1"

The default locale in the absence of the aforementioned locale environment variables is "C.UTF-8".

Windows uses the UTF-16 charset exclusively to store the names of any object used by the Operating System. This is especially important with filenames. Cygwin uses the setting of the locale environment variables LC_ALL, LC_CTYPE, and LANG, to determine how to convert Windows filenames from their UTF-16 representation to the singlebyte or multibyte character set used by Cygwin.

The setting of the locale environment variables at process startup is effective for Cygwin's internal conversions to and from the Windows UTF-16 object names for the entire lifetime of the current process. Changing the environment variables to another value changes the way filenames are converted in subsequently started child processes, but not within the same process.

However, even if one of the locale environment variables is set to some other value than "C", this does only affect how Cygwin itself converts filenames. As the POSIX standard requires, it's the applications responsibility to activate that locale for its own purpose, typically by using the call

  setlocale (LC_ALL, "");

early in the application code. Again, so that this doesn't get lost: If the application calls setlocale as above, and there is none of the important locale variables set in the environment, the locale is set to the default locale, which is "C.UTF-8".

Right now the language and territory, as well as the modifier, are not important to Cygwin, except to fix a single problem. There's a class of characters in the Unicode character set, called the "CJK Ambiguous Width Character set". For these characters the width returned by the wcwidth/wcswidth function is usually 1. This is often a problem in East-Asian languages, which historically use character sets in which these characters have a width of 2. Kind of explains why they are called "ambiguous"...

The problem has been fixed like this. wcwidth/wcswidth usually return 1 as the width of these characters. However, if the language is specifed as "ja" (Japanese), "ko" (Korean), or "zh" (Chinese), wcwidth returns 2 for these characters. Unfortunately this isn't correct in all circumstances, so the user can specify the modifier "@cjknarrow", which modifies the behaviour of wcwidth/wcswidth to return 1 for the ambiguous width characters to return 1 even in those languages.

Other than that, the only important part so far is the character set. How does that work?

How to set the locale

  • The default locale is the "C" or "POSIX" locale. Under Cygwin this locale defaults to the UTF-8 character set.

  • Assume that you've set one of the aforementioned environment variables to some valid POSIX locale value, other than "C" and "POSIX". Assume further that you're living in Japan. You might want to use the language code "ja" and the territory "JP", thus setting, say, LANG to "ja_JP". You didn't set a character set, so what will Cygwin use now? Easy! It will use the default Windows ANSI codepage of your system, if it's supported by Cygwin. Hopefully Cygwin supports all relevant default ANSI codepages...

    Note

    For a list of supported character sets, see the section called “List of supported character sets”

  • You don't want to use the default Windows codepage as character set? In that case you have to specify the charset explicitly. For instance, assume you're from Italy and don't want to use the Italian default Windows ANSI codepage 1252, but the more portable ISO-8859-15 character set. What you can do, for instance, is to set the LANG variable in the C:\cygwin\Cygwin.bat file which is the batch file to start a Cygwin session from the "Cygwin" desktop shortcut.

      @echo off
    
      C:
      chdir C:\cygwin\bin
      set LANG=it_IT.ISO-8859-15
      bash --login -i
    
  • Last, but not least, most singlebyte or doublebyte charsets have a big disadvantage. Windows filesystems use the Unicode character set in the UTF-16 encoding to store filename information. Not all characters from the Unicode character set are available in a singlebyte or doublebyte charset. While Cygwin has a workaround to access files with unusual characters (see the section called “Filenames with unusual (foreign) characters”), a better workaround is to use always the UTF-8 character set.i

    UTF-8 is the only multibyte character set which can represent every Unicode character.

      set LANG=es_MX.UTF-8
    

    For a description of the Unicode standard, see the homepage of the Unicode Consortium.

The Windows Console character set

Most of the time the Windows console is used to run Cygwin applications. While terminal emulations like xterm or mintty have a distinct way to set the character set used for in- and output, the Windows console hasn't such a way, since it's not an application in its own right.

This problem is solved in Cygwin as follows. When a Cygwin process is started in a Windows console (either explicitly from cmd.exe, or implicitly by, for instance, clicking on the Cygwin desktop icon, or running the Cygwin.bat file), the Console character set is determined by the setting of the aforementioned internationalization environment variables, the same way as described in the section called “How to set the locale”.

What is that good for? Why not switch the console character set with the applications requirements? After all, the application knows if it uses localization or not. However, what if a non-localized application calls a remote application which itself is localized? This can happen with ssh or rlogin. Both commands don't have and don't need localization and they never call setlocale. Setting one of the internationalization environment variable to the same charset as the remote machine before starting ssh or rlogin fixes that problem.

Potential Problems when using Locales

You can set the above internationalization variables not only in Cygwin.bat or in the Windows environment, but also in your Cygwin shell on the fly, even switch to yet another character set, and yet another. In bash for instance:

  bash$ export LC_CTYPE="nl_BE.UTF-8"

However, here's a problem. At the start of the first Cygwin process in a session, the Windows environment is converted from UTF-16 to UTF-8. The environment is another of the system objects stored in UTF-16 in Windows.

As long as the environment only contains ASCII characters, this is no problem at all. But if it contains native characters, and you're planning to use, say, GBK, the environment will result in invalid characters in the GBK charset. This would be especially a problem in variables like PATH. To circumvent the worst problems, Cygwin converts the PATH environment variable to the charset set in the environment, if it's different from the UTF-8 charset.

Note

Per POSIX, the name of an environment variable should only consist of valid ASCII characters, and only of uppercase letters, digits, and the underscore for maximum portablilty.

Symbolic links, too, may pose a problem when switching charsets on the fly. A symbolic link contains the filename of the target file the symlink points to. When a symlink had been created with older versions of Cygwin, the current ANSI or OEM character set had been used to store the target filename, dependent on the old CYGWIN environment variable setting codepage (see the section called “Obsolete options”. If the target filename contains non-ASCII characters and you use another character set than your default ANSI/OEM charset, the target filename of the symlink is now potentially an invalid character sequence in the new character set. This behaviour is not different from the behaviour in other Operating Systems. So, if you suddenly can't access a symlink anymore which worked all these years before, maybe it's because you switched to another character set. This doesn't occur with symlinks created with Cygwin 1.7 or later.

Another problem you might encounter is that older versions of Windows did not install all charsets by default. If you are running Windows XP or older, you can open the "Regional and Language Options" portion of the Control Panel, select the "Advanced" tab, and select entries from the "Code page conversion tables" list. The following entries are useful to cygwin: 932/SJIS, 936/GBK, 949/EUC-KR, 950/Big5, 20932/EUC-JP.

What does not work?

Except for LC_ALL, LC_CTYPE, and LANG, all other LC_xxx environment variables, LC_COLLATE, LC_MESSAGES, LC_MONETARY, LC_NUMERIC, and LC_TIME, are ignored right now. This means, while Cygwin supports different character sets, it does not support real localization so far. There's no support for locale-specific monetary symbols, for a decimalpoint other than '.', no support for native time formats, and no support for native language sorting orders.

Cygwin's internationalization support is work in progress and we would be glad for coding help in this area.

List of supported character sets

Last but not least, here's the list of currently supported character sets. The left-hand expression is the name of the charset, as you would use it in the internationalization environment variables as outlined above. Note that charset specifiers are case-insensitive. EUCJP is equivalent to eucJP or eUcJp. Writing the charset in the exact case as given in the list below is a good convention, though.

The right-hand side is the number of the equivalent Windows codepage as well as the Windows name of the codepage. They are only noted here for reference. Don't try to use the bare codepage number or the Windows name of the codepage as charset in locale specifiers, unless they happen to be identical with the left-hand side. Especially in case of the "CPxxx" style charsets, always use them with the trailing "CP".

This works:

  set LC_ALL=en_US.CP437

This does not work:

  set LC_ALL=en_US.437

You can find a full list of Windows codepages on the Microsoft MSDN page Code Page Identifiers.

    Charset               Codepage

    CP437                   437 (OEM United States)
    CP720                   720 (DOS Arabic)
    CP737                   737 (OEM Greek)
    CP775                   775 (OEM Baltic)
    CP850                   850 (OEM Latin 1, Western European)
    CP852                   852 (OEM Latin 2, Central European)
    CP855                   855 (OEM Cyrillic)
    CP857                   857 (OEM Turkish)
    CP858                   858 (OEM Latin 1 + Euro Symbol)
    CP862                   862 (OEM Hebrew)
    CP866                   866 (OEM Russian)
    CP874                   874 (ANSI/OEM Thai)
    CP1125                 1125 (OEM Ukraine)
    CP1250                 1250 (ANSI Central European)
    CP1251                 1251 (ANSI Cyrillic)
    CP1252                 1252 (ANSI Latin 1, Western European)
    CP1253                 1253 (ANSI Greek)
    CP1254                 1254 (ANSI Turkish)
    CP1255                 1255 (ANSI Hebrew)
    CP1256                 1256 (ANSI Arabic)
    CP1257                 1257 (ANSI Baltic)
    CP1258                 1258 (ANSI/OEM Vietnamese)

    ISO-8859-1            28591 (ISO-8859-1)
    ISO-8859-2            28592 (ISO-8859-2)
    ISO-8859-3            28593 (ISO-8859-3)
    ISO-8859-4            28594 (ISO-8859-4)
    ISO-8859-5            28595 (ISO-8859-5)
    ISO-8859-6            28596 (ISO-8859-6)
    ISO-8859-7            28597 (ISO-8859-7)
    ISO-8859-8            28598 (ISO-8859-8)
    ISO-8859-9            28599 (ISO-8859-9)
    ISO-8859-10             -   (not available)
    ISO-8859-11             -   (not available)
    ISO-8859-13           28603 (ISO-8859-13)
    ISO-8859-14             -   (not available)
    ISO-8859-15           28605 (ISO-8859-15)
    ISO-8859-16             -   (not available)

    KOI8-R                20866 (KOI8-R Russian Cyrillic)
    KOI8-U                21866 (KOI8-U Ukrainian Cyrillic)
    SJIS                    932 (ANSI/OEM Japanese)
    GBK                     936 (ANSI/OEM Simplified Chinese)
    Big5                    950 (ANSI/OEM Traditional Chinese)
    eucJP                 20932 (EUC Japanese)
    eucKR                   949 (EUC Korean)

    UTF-8 or UTF8         65001 (UTF-8)