This is the mail archive of the
cygwin
mailing list for the Cygwin project.
Re: [BUG REPORT]sed -e 's/[B-D]/_/g' replaces unexpected characters
- From: Corinna Vinschen <corinna-cygwin at cygwin dot com>
- To: cygwin at cygwin dot com
- Date: Wed, 26 Jun 2013 11:19:38 +0200
- Subject: Re: [BUG REPORT]sed -e 's/[B-D]/_/g' replaces unexpected characters
- References: <CA+nJC97He=j-O2FZ-Y2jJhYXEJn2o2EfC1wO39+2bZ=nj1f-zA at mail dot gmail dot com> <20130625152356 dot GD11958 at calimero dot vinschen dot de> <5F8AAC04F9616747BC4CC0E803D5907D0C37C240 at MLBXv04 dot nih dot gov> <20130625160359 dot GB14459 at calimero dot vinschen dot de> <20130625160911 dot GC14459 at calimero dot vinschen dot de>
- Reply-to: cygwin at cygwin dot com
On Jun 25 18:09, Corinna Vinschen wrote:
> On Jun 25 18:03, Corinna Vinschen wrote:
> > On Jun 25 15:38, Lavrentiev, Anton (NIH/NLM/NCBI) [C] wrote:
> > > > Your locale is zh_CN.UTF-8. What you're expecting is only guaranteed
> > > > in the C locale:
> > > [...]
> Which also means, AFAICS, Cygwin's sed is doing it right, Linux' sed
> is doing it wrong. Yes, that puzzles me a bit at the moment, too.
I had a discussion with my collegues from the Linux side of Red Hat.
The bottom line is, we're both doing it right, just differently.
As for the difference itself, here's what happened:
The gawk maintainer was unhappy with how regex ranges worked when using
locales other than the C locale. So he implemented a change to regex
which he called "rational ranges". The idea being, that something like
[b-d] always means lowercase only, [B-D] means uppercase only, independent
of the locale we're in.
This change to the regex handling not only made it into gawk(*), but
also into glibc(**) and perl regex, but not into sed or bash, for
instance.
That's why sed under Cygwin shows the default, collation-abiding
behaviour when using a non-C locale. Under Fedora 18 it shows the new
"rational ranges" behaviour, because glibc supports them and sed has
been built with the --without-included-regex option.
I just checked the new upstream sed 4.2.2 (will upload shortly) and it
still doesn't implement "rational ranges", even though its regex is
derived from gnulib's regex.
Corinna
(*) Try echo abcdeABCDE | awk '{ gsub(/[B-D]/, "_"); print }'
(**) http://sourceware.org/ml/libc-alpha/2012-12/msg00456.html
--
Corinna Vinschen Please, send mails regarding Cygwin to
Cygwin Maintainer cygwin AT cygwin DOT com
Red Hat
--
Problem reports: http://cygwin.com/problems.html
FAQ: http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple