This is the mail archive of the cygwin mailing list for the Cygwin project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Unicode width data inconsistent/outdated

From: Corinna Vinschen <corinna-cygwin at cygwin dot com>
To: cygwin at cygwin dot com
Date: Mon, 7 Aug 2017 12:41:27 +0200
Subject: Re: Unicode width data inconsistent/outdated
Authentication-results: sourceware.org; auth=none
References: <f3c1b415-7a26-8bbe-a67f-5619d356f058@towo.net> <20170726080859.GA24312@calimero.vinschen.de> <5d3cb047-49f8-26a6-d816-387a71486e99@cygwin.com> <20170726095016.GA25666@calimero.vinschen.de> <289bd98b-e644-888d-07f8-8965b6538373@towo.net> <20170728195826.GI24013@calimero.vinschen.de> <1244bd24-bb27-d185-1f24-61beae02c2cd@towo.net> <20170804170156.GL25551@calimero.vinschen.de> <30486790-c59d-9a78-6000-b3c20fb86d9d@towo.net> <20170807092820.GQ25551@calimero.vinschen.de>
Reply-to: cygwin at cygwin dot com

On Aug  7 11:28, Corinna Vinschen wrote:
> On Aug  5 21:06, Thomas Wolff wrote:
> > Am 04.08.2017 um 19:01 schrieb Corinna Vinschen:
> > > This shouldn't matter to you, just keep it in place.  It's a historical,
> > > low footprint conversion for japanese characters without pulling in the
> > > unicode stuff.  Not used on Cygwin so just ignore.
> > I had noticed meanwhile that this is not active in Cygwin, but it's broken
> > anyway for multiple reasons:
> >    * platforms for which wchar_t is not Unicode should be explicitly listed
> >    * if used, the transformation needs to be applied to all non-Unicode
> > locales (also Chinese, Korean, and even 8-bit locales such as *.CP1252)
> >    * for towupper and towlower, the result must be back-transformed into the
> > respective locale encoding
> >    * particulary the locale-specific _l functions inconsistently do not use
> > the transformation but have this note:
> 
> No, no, no.  The functionality is restricted to certain use-cases and
> always was.  It was a paid-for customer extension back in the day and it
> was *sufficient* for the use-cases.  It's not clear how many newlib
> users are still using it, but it's not a good idea to remove it without
> checking first.  That means, ask on the newlib mailing list how many are
> using the historical jp2uc code, and if we don't get a reply within,
> say, a month, we can probably nuke it.

To clarify where we're coming from:

If you look into newlib/libc/locale/locale.c, function __loadlocale,
you'll notice that outside of Cygwin, only six single/double/multi-bytes
codesets are supported at all:

  ASCII
  ISO-8859-1
  EUCJP
  JIS
  SJIS
  UTF-8

The multichar/widechar conversion functions for EUCJP, JIS and SJIS were
implemented to have a low footprint in the first place, see, for
instance, __sjis_wctomb in newlib/libc/stdlib/wctomb_r.c.

This is all about simplification for small targets.  There was never a
requirement that converting a UTF-8 char to wchar_t, and converting the
equivalent SJIS char to wchar_t would result in the same wide char.

Consequentially, Cygwin does not use these conversion functions.  Rather
it uses Windows conversion functions, see the conversion functions in
winsup/cygwin/strfuncs.cc, to get a consistent wide char representation
(UTF-16).  Another side-effect is that Cygwin does not support JIS at
all, only SJIS, see the comment in strfuncs.cc.

Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Maintainer                 cygwin AT cygwin DOT com
Red Hat

Attachment: signature.asc
Description: PGP signature

References:
- Re: Unicode width data inconsistent/outdated
  - From: Thomas Wolff
- Re: Unicode width data inconsistent/outdated
  - From: Corinna Vinschen
- Re: Unicode width data inconsistent/outdated
  - From: Thomas Wolff
- Re: Unicode width data inconsistent/outdated
  - From: Corinna Vinschen

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]