This is the mail archive of the cygwin mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [1.7] Invalid UTF8 while creating a file -> cannot delete?


2009/9/22 Corinna Vinschen:
>> >> Therefore, when converting a UTF-16 Windows filename to the current
>> >> charset, 0xDC?? words should be treated like any other UTF-16 word
>> >> that can't be represented in the current charset: it should be encoded
>> >> as a ^N sequence.

(I started writing this before seeing your patch to the singlebyte
codepage tables, which makes plenty of sense. Here goes anyway.)

Having actually looked at strfuncs.cc, my diagnosis was too
simplistic, because the U+DC?? codes are used not only for invalid
UTF-8 bytes, but for invalid bytes in any charset. This even includes
CP1252, which has a few holes in the 0x80..0x9F range.

Therefore, the complete solution would be something like this: when
sys_cp_wcstombs comes across a 0xDC?? code, it checks whether the byte
it encodes is indeed an invalid byte in the current charset. If it is,
it translates it into that invalid byte, because on the way back it
would once again be turned into the same 0xDC?? code. If the byte
would represent (part of) a valid character, however, it would need to
be encoded as a ^N sequence to ensure correct roundtripping.

Now that shouldn't be too difficult to implement for singlebyte
charsets, but it gets somewhat hairy for multibyte charsets, including
UTF-8 itself. Here's how I think it could be done though:

In sys_cp_wcstombs:

* On encountering a DC?? code, extract the encoded byte, and feed it
into f_mbtowc. A private mbstate for this is needed, starting in the
initial state for each filename. Switch on the result of f_mbtowc:
** case -2 (incomplete sequence): add the byte to a buffer for this purpose
** case -1 (invalid sequence): copy anything already in the buffer
plus the current byte into the target filename, as we can be sure that
they'll turn back into U-DCbb again on the way back.
** case >0 (valid sequence): encode buffer contents and current byte
as a ^N codes that don't represent valid UTF-8

* When encountering a non-DC?? code, copy any bytes left in the buffer
into the target filename.

Unfortunately the latter point still leaves a loophole, in case the
incomplete sequence from the buffer and the subsequent bytes combine
into something valid. Singlebyte charset aren 't affected though,
because they don't have continuation bytes. Nor is UTF-8, because it
was designed such that continuation bytes are distinct from initial
bytes. Which leaves the DBCS charsets.

However, it rather looks like DBCSs are an intractable problem here in
any case, because of issues like this:

http://support.microsoft.com/kb/170559: "There are some codes that are
not matched one-to-one between Shift-JIS (Japanese character set
supported by MS) and Unicode. When an application calls
MultiByteToWideChar() and WideCharToMultiByte() to perform code
conversion between Shift-JIS and Unicode, the function returns the
wrong code value in some cases."

Which leaves me scratching my head regarding the C locale. More later ...

Andy

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]