This is the mail archive of the cygwin mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Unicode width data inconsistent/outdated


On Aug  7 11:28, Corinna Vinschen wrote:
> On Aug  5 21:06, Thomas Wolff wrote:
> > Am 04.08.2017 um 19:01 schrieb Corinna Vinschen:
> > > This shouldn't matter to you, just keep it in place.  It's a historical,
> > > low footprint conversion for japanese characters without pulling in the
> > > unicode stuff.  Not used on Cygwin so just ignore.
> > I had noticed meanwhile that this is not active in Cygwin, but it's broken
> > anyway for multiple reasons:
> >    * platforms for which wchar_t is not Unicode should be explicitly listed
> >    * if used, the transformation needs to be applied to all non-Unicode
> > locales (also Chinese, Korean, and even 8-bit locales such as *.CP1252)
> >    * for towupper and towlower, the result must be back-transformed into the
> > respective locale encoding
> >    * particulary the locale-specific _l functions inconsistently do not use
> > the transformation but have this note:
> 
> No, no, no.  The functionality is restricted to certain use-cases and
> always was.  It was a paid-for customer extension back in the day and it
> was *sufficient* for the use-cases.  It's not clear how many newlib
> users are still using it, but it's not a good idea to remove it without
> checking first.  That means, ask on the newlib mailing list how many are
> using the historical jp2uc code, and if we don't get a reply within,
> say, a month, we can probably nuke it.

To clarify where we're coming from:

If you look into newlib/libc/locale/locale.c, function __loadlocale,
you'll notice that outside of Cygwin, only six single/double/multi-bytes
codesets are supported at all:

  ASCII
  ISO-8859-1
  EUCJP
  JIS
  SJIS
  UTF-8

The multichar/widechar conversion functions for EUCJP, JIS and SJIS were
implemented to have a low footprint in the first place, see, for
instance, __sjis_wctomb in newlib/libc/stdlib/wctomb_r.c.

This is all about simplification for small targets.  There was never a
requirement that converting a UTF-8 char to wchar_t, and converting the
equivalent SJIS char to wchar_t would result in the same wide char.

Consequentially, Cygwin does not use these conversion functions.  Rather
it uses Windows conversion functions, see the conversion functions in
winsup/cygwin/strfuncs.cc, to get a consistent wide char representation
(UTF-16).  Another side-effect is that Cygwin does not support JIS at
all, only SJIS, see the comment in strfuncs.cc.


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Maintainer                 cygwin AT cygwin DOT com
Red Hat

Attachment: signature.asc
Description: PGP signature


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]