This is the mail archive of the cygwin-developers mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Lone surrogates in UTF-8? (was: Re: Console codepage setting via chcp?)


2009/9/27 Corinna Vinschen:
>> > Last but not least, you cannot have both, graceful handling of invalid
>> > sequences *and* a bijective relation between UTF-16 and multibyrte
>> > strings. ÂThere's always a tradeoff.
>>
>> Correct. However, you can have correct roundtripping from any Unix
>> filename to a Windows filename and back to the same Unix filename
>> (well, with UTF-8 and singlebyte charsets anyway.
>
> What about "\xed\xb2\x80"? ÂThat's UTF-16 0xDC80 which, if recognized
> as "special invalid byte sequence" is translated back to "\x80".

Yep, that's problematic too, which is why I was arguing against
accepting "\xed\xb2\x80" as UTF-8 in the first place, meaning it
should be treated as three invalid UTF-8 bytes, represented as:

U+DCED U+DCB2 U+DC80

But scratch that.


> I'm getting headaches.

Same here. Someone ought to be shot for UTF-16.


> What about this: ÂThe private use area U+f0xx is already used for ASCII
> chars invalid in Windows filenames. ÂThe same range can be used for
> invalid chars > 0x80. ÂThis could happen unconditionally.

That's a great idea, allowing both lone surrogate support and Unix
filename transparency.

[time passes]

Nope, can't think of anything wrong with it. :)

Andy


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]