This is the mail archive of the cygwin-developers mailing list for the Cygwin project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Lone surrogates in UTF-8? (was: Re: Console codepage setting via chcp?)

From: Corinna Vinschen <corinna-cygwin at cygwin dot com>
To: cygwin-developers at cygwin dot com
Date: Sun, 27 Sep 2009 18:14:55 +0200
Subject: Re: Lone surrogates in UTF-8? (was: Re: Console codepage setting via chcp?)
References: <416096c60909262332j37d13eb4k400a7ca6c488872e@mail.gmail.com> <20090927091331.GB30851@calimero.vinschen.de> <416096c60909270322x32d94673h47ff7c28231cb09e@mail.gmail.com> <20090927110025.GC30851@calimero.vinschen.de> <416096c60909270414y52d93f6fncfe72852bb3331fe@mail.gmail.com> <20090927120414.GD30851@calimero.vinschen.de> <416096c60909270606h4cc2dd4ctbc1da5c5b1a310bb@mail.gmail.com>
Reply-to: cygwin-developers at cygwin dot com

On Sep 27 14:06, Andy Koppe wrote:
> 2009/9/27 Corinna Vinschen:
> > Last but not least, you cannot have both, graceful handling of invalid
> > sequences *and* a bijective relation between UTF-16 and multibyrte
> > strings. ?There's always a tradeoff.
> 
> Correct. However, you can have correct roundtripping from any Unix
> filename to a Windows filename and back to the same Unix filename
> (well, with UTF-8 and singlebyte charsets anyway.

What about "\xed\xb2\x80"?  That's UTF-16 0xDC80 which, if recognized
as "special invalid byte sequence" is translated back to "\x80".

> inherently dodgy anyway).
> 
> And I contend that that's more important than supporting invalid
> UTF-16 in Windows filenames not created by Cygwin.

But there's no reason to disallow lone surrogate halves besides the
U+DCxx range with xx >= 0x80.  Only this tiny range can represent
a stray singlebyte char.

> >?Either you disallow some multibyte
> > filenames, or you have to live with the fact that two different
> > multibyte sequences translate to the same UTF-16 filename or vice versa.
> > The latter is IMHO the lesser problem. ?The probability that an
> > application tries to use two different files named foo-\x80 and
> > foo-\xed\xb2\x80 is almost nil.
> 
> Accepted, but I don't think that's the main issue here.
> 
> Here's an example: say you've got your locale set to UTF-8, and you
> unpack a tarball created on a ISO-8859-1 system that contains a file
> called "Ä". This turns into U+DCC4 on disk. So far so good.
> 
> Now you run 'convmv -f ISO-8859-1 -t UTF-8' on it to correct the
> filename, but instead of a single ISO-8859-1 byte representing "Ä",
> convmv will see the three bytes of the low surrogate, and hence the
> filename will end up with three UTF-8 characters instead of one.
> 
> Also, there'll probably be testsuites that trip over this, e.g. Lapo
> Lucchini's 'monotone' tests that triggered this whole discussio

I'm getting headaches.

What about this:  The private use area U+f0xx is already used for ASCII
chars invalid in Windows filenames.  The same range can be used for
invalid chars > 0x80.  This could happen unconditionally.  We already
can't handle Windows filenames with characters in this range without
character conversion.  So, why not just use this area and be done with
it?  This allows lossless CESU-8 byte handling, and the handling for
"special" characters is reduced to a minimum of code, and a minimum of
impact on existing filenames.

Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

Follow-Ups:
- Re: Lone surrogates in UTF-8? (was: Re: Console codepage setting via chcp?)
  - From: Andy Koppe

References:
- Lone surrogates in UTF-8? (was: Re: Console codepage setting via chcp?)
  - From: Andy Koppe
- Re: Lone surrogates in UTF-8? (was: Re: Console codepage setting via chcp?)
  - From: Corinna Vinschen
- Re: Lone surrogates in UTF-8? (was: Re: Console codepage setting via chcp?)
  - From: Andy Koppe
- Re: Lone surrogates in UTF-8? (was: Re: Console codepage setting via chcp?)
  - From: Corinna Vinschen
- Re: Lone surrogates in UTF-8? (was: Re: Console codepage setting via chcp?)
  - From: Andy Koppe
- Re: Lone surrogates in UTF-8? (was: Re: Console codepage setting via chcp?)
  - From: Corinna Vinschen
- Re: Lone surrogates in UTF-8? (was: Re: Console codepage setting via chcp?)
  - From: Andy Koppe

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]