This is the mail archive of the cygwin-developers mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: GB18030 (was: Re: charset changes)


On Mar 27 17:02, Andy Koppe wrote:
> On 27 March 2010 13:33, Corinna Vinschen wrote:
> > On Mar 27 06:47, Andy Koppe wrote:
> >>  "Interestingly", if you give it only
> >> two bytes of a 4-byte GB18030 sequence, e.g. \x95 \x33, it interprets
> >> that as a one-byte invalid sequence followed by the digit '3'.
> >
> > Huh? ?How did you test that? ?AFAIK MultiByteToWideChar, it doesn't
> > tell you how many and which bytes it treated as valid substring.
> 
> On Vista and 7, if you pass those two bytes to MultiByteToWideChar,
> you get back the codepage's UnicodeDefaultChar followed by the digit
> '3'. XP did something else, but I can't remember exactly what.

Heh, ok.  It never occured to me to test the content of the target
buffer if MultiByteToWideChar failed anyway.
> 
> >> Therefore I think the best thing to do is to manually parse GB18030
> >> sequences, which is fairly straightforward, and only hand complete
> >> sequences over to MultiByteToWideChar for translation to UTF-16. Shall
> >> I have a go at that?
> >
> > I would really be glad. ?You'd just create two functions __gb18030_mbtowc
> > and __gb18030_wctomb in strfuncs.cc, and I could easily add it to newlib's
> > setlocale_r. ?Oh, and then there's check_codepage in nlsfuncs.cc which
> > needs to test if codepage 54936 is installed.
> >
> > However, here's a problem. ?Adding these functions is non-trivial code
> > and requires a copyright assignment... sigh.
> 
> How about implementing __gb18030_mbtowc/wctomb in newlib, which would
> handle all the mbstate stuff, with the actual encoding and decoding
> factored out into functions like this:
> 
> size_t __gb18030_encode(char *dst, const wchar_t *src, size_t
> src_len): Pass in one codepoint, consisting of one or two wchars
> (always one in case of a 32-bit wchar_t). Return the length of the
> resulting multibyte sequence.
> 
> size_t __gb18030_decode(wchar_t *dst, const char *src, size_t
> src_len): Pass in a valid multibyte sequence. Return the number of
> wchars needed to represent it.
> 
> On Cygwin, these would be straightforward wrappers around
> WideCharToMultibyte and MultibyteToWideChar with codepage 54936,
> implemented in winsup. For other newlib targets, we could take a
> similar approach as with doublebyte charsets, where multibyte
> sequences are mapped to a non-Unicode wchar_t representation by simply
> packing the bytes into the wchar_t.

Yet another function call for every single character:
http://sourceware.org/ml/newlib/2009/msg01033.html


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]