This is the mail archive of the cygwin mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: bug in mbrtowc?


2009/7/28 Corinna Vinschen:
> On Jul 27 22:56, Andy Koppe wrote:
>> I've encountered what looks like a bug in mbrtowc's handling of UTF-8.
>> Here's an example:
>>
>> #include <stdio.h>
>> #include <locale.h>
>> #include <stdlib.h>
>> #include <wchar.h>
>>
>> int main(void) {
>> Â wchar_t wc;
>> Â size_t ret;
>> Â mbstate_t s = { 0 };
>> Â puts(setlocale(LC_CTYPE, "en_GB.UTF-8"));
>> Â printf("%i\n", mbrtowc(&wc, "\xe2", 1, 0));
>> Â printf("%i\n", mbrtowc(&wc, "\x94", 1, 0));
>> Â printf("%i\n", mbrtowc(&wc, "\x84", 1, 0));
>> Â printf("%x\n", wc);
>> Â return 0;
>> }
>>
>> The sequence E2 94 84 should translate to U+2514. Instead, the second
>> and third calls to mbrtowc report encoding errors. It does work
>> correctly if the three bytes are passed to mbrtowc() in one go:
>>
>> Â printf("%i\n", mbrtowc(&wc, "\xe2\x94\x84", 3, 0));
>
> That's a bug in the newlib function __utf8_mbtowc. ÂI'm really surprised
> that this bug has never been reported before since it's in the code for
> years, probably since it has been introduced in 2002.

I guess normallly programs just pass whole strings to mbrstowcs?

I've had a look at the code, but didn't grasp it enough to suggest a
fix. I'd also wondered how mbrtowc() deals with non-BMP characters
given that wchar_t is only 16 bits wide, and was quite pleased to see
that it does have a special hack for turning them into UTF-16
surrogates.

Trouble is, the hack will also only work correctly if the whole UTF-8
sequence for the non-BMP character is passed at once. If you pass the
bytes one-by-one instead, and assuming the bug above wasn't there,
you'd get this:

With UTF-8 '0xF0 0x92 0x8D 0x85' == UTF-16 '0xD808 0xDF45' == UTF-32 '0x12345':

mbrtowc(&wc, "\xF0", 1, 0) returns -2
mbrtowc(&wc, "\x92", 1, 0) returns -2
mbrtowc(&wc, "\x8D", 1, 0) returns -2
mbrtowc(&wc, "\x85", 1, 0) returns 2, writes 0xD808 to wc
mbrtowc(&wc, "A", 1, 0) returns 2, writes 0xDF45 to wc
mbrtowc(&wc, "B", 1, 0) returns 1, writes 0x42 to wc

Two problems here:
- the "A" is quietly dropped
- mbrtowc should not return a number greater than the size argument.

I guess the latter point is a good thing in as far as it allows
programs to recognise that something special is going on, but of
course they do need to be aware of it in the first place.

If the UTF-8 sequence gets split differently, up to three characters
can end up being dropped:

mbrtowc(&wc, "\xF0\x92\x8D", 3, 0) returns -2
mbrtowc(&wc, "\x85""A", 2, 0) returns 2, writes 0xD808 to wc
mbrtowc(&wc, "BC", 2, 0) returns 2, writes 0xDF45 to wc

Unfortunately I can't see a way to fix this that would comply with
mbrtowc's specification.

The best I can think of is to return the true number of consumed
character for the high surrogate, and zero for the low surrogate. The
low surrogate would then need to be differentiated from the null wide
character by checking wc. Hence you'd get:

Andy

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]