This is the mail archive of the cygwin-developers mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Lone surrogates in UTF-8? (was: Re: Console codepage setting via chcp?)


On Sep 27 14:04, Corinna Vinschen wrote:
> On Sep 27 12:14, Andy Koppe wrote:
> > 2009/9/27 Corinna Vinschen:
> > > I don't understand this one. ?That's not what I observe after I have
> > > changed the __utf8_wctomb and __utf8_mbtowc functions accordingly.
> > > A single byte 0x80 gets encoded to U+DC80. ?The round trip results
> > > in \xed\xb2\x80.
> > 
> > Ah, I'd assumed that U+DCxx in filenames would continue to map to xx
> > (and vice versa). Either way, this would mean that filenames aren't
> > transparent: the name can change between open() and readdir().
> > 
> > ... pondering ...
> > 
> > Therefore I think that lone surrogates shouldn't be allowed after all,
> > because Unix filename transparency is more important than being able
> > to access Windows filenames with invalid UTF-16 (which can't have been
> > created within Cygwin).
> 
> After being through this and looking into what happens, I disagree.
> [...]

Btw., with my changes to __utf8_mbtowc and __utf8_wctomb, that's what
happens.  Below a small test application.  It just omits the single
invalid byte value test since that's not visible to applications.

== SNIP ==
/* foo.c */
#include <stdio.h>
#include <wchar.h>
#include <string.h>

void
print_c (const unsigned char *c)
{
  int i;
  for (i = 0; c[i]; ++i)
    printf ("%02x ", c[i]);
  puts ("");
}

void
doit (const char *in)
{
  wchar_t w[64];
  char c1[64], c2[64];
  int i;

  strcpy (c1, in);
  mbstowcs (w, c1, 64);
  wcstombs (c2, w, 64);
  print_c (c1);
  for (i = 0; w[i]; ++i)
    printf ("%04x ", w[i]);
  puts ("");
  print_c (c2);
  puts ("");
}

int
main ()
{
  doit (" \xf0\x90\x80\x81 ");
  doit (" \xed\xa0\x8d ");
  doit (" \xed\xa0\x8d");
  doit (" \xed\xb0\x8d ");
  doit (" \xed\xb0\x8d");
  doit (" \xed\xa0\x8d\xed\xb0\x8d ");
  doit (" \xed\xb0\x8d\xed\xa0\x8d ");
  doit (" \xed\xa0\x8d \xed\xb0\x8d ");
  doit (" \xed\xb0\x8d \xed\xa0\x8d ");
  doit (" \xed\xa0\x8d  \xed\xb0\x8d ");
  doit (" \xed\xb0\x8d  \xed\xa0\x8d ");
}
== SNAP ==

  $ gcc -g -o foo foo.c
  $ ./foo
  20 f0 90 80 81 20
  0020 d800 dc01 0020
  20 f0 90 80 81 20

  20 ed a0 8d 20
  0020 d80d 0020
  20 ed a0 8d 20

  20 ed a0 8d
  0020 d80d
  20 ed a0 8d

  20 ed b0 8d 20
  0020 dc0d 0020
  20 ed b0 8d 20

  20 ed b0 8d
  0020 dc0d
  20 ed b0 8d

  20 ed a0 8d ed b0 8d 20       <== Valid surrogate
  0020 d80d dc0d 0020
  20 f0 93 90 8d 20		<== so that's to be expected

  20 ed b0 8d ed a0 8d 20
  0020 dc0d d80d 0020
  20 ed b0 8d ed a0 8d 20

  20 ed a0 8d 20 ed b0 8d 20
  0020 d80d 0020 dc0d 0020
  20 ed a0 8d 20 ed b0 8d 20

  20 ed b0 8d 20 ed a0 8d 20
  0020 dc0d 0020 d80d 0020
  20 ed b0 8d 20 ed a0 8d 20

  20 ed a0 8d 20 20 ed b0 8d 20
  0020 d80d 0020 0020 dc0d 0020
  20 ed a0 8d 20 20 ed b0 8d 20

  20 ed b0 8d 20 20 ed a0 8d 20
  0020 dc0d 0020 0020 d80d 0020
  20 ed b0 8d 20 20 ed a0 8d 20


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]