This is the mail archive of the
cygwin-apps
mailing list for the Cygwin project.
Re: ITP: rxvt-unicode-X
- From: Thomas Wolff <mined at towo dot net>
- To: "Charles Wilson" <cygwin at cwilson dot fastmail dot fm>, cygwin-apps at cygwin dot com
- Date: Thu, 18 May 2006 20:43:45 +0200
- Subject: Re: ITP: rxvt-unicode-X
I have now succeeded in finishing my Unicode support hook for rxvt on
cygwin (almost, as far as Unicode operation is concerned).
There were some more obstacles to take which I will describe below in
case anyone is interested :)
A few problems remain:
* If I start rxvt in NON-Unicode mode, 8 bit input doesn't work. This
also happens with the unpatched rxvt-unicode 6.0 (compiled from the
source archive), but it works in Charles' package, so I would hope
that the patch is applicable to the package without injecting this
error.
* The wchar_t type on cygwin is only "unsigned short", raising a minor
problem with handling Unicode characters beyond 16 bit; my patch is
now mapping the output to the Unicode replacement character U+FFFD.
Substituting a sufficiently wide type might work but would require
more subtle modifications to the code.
* Charles pointed out that an application can use setlocale multiple
times, switching encoding dynamically, and that rxvt actually does
that (although I didn't understand for which purpose). Anyway,
a proper substitution of setlocale that mimics this behaviour is
still missing in my patch library.
* Suspected remaining handling bug in 'draw_string' as described below.
To apply the patch, please unzip the uwc.zip archive in the rxvt
src subdirectory. Then invoke the uwc script which applies the patch
generically, by substituting the respective function names in the
source files. The final "return NOCHAR" fix described below still has
to be applied manually, sorry.
The patch can be downloaded from <http://towo.net/mined/cygwin/uwc.zip>
Thomas
------------------------------------------------------------------------
Now about the problems I had:
* First, I had to remove one more bug in my wide character replacement
functions in order to avoid an occasional crash. Alright.
* Then, Unicode input still would not work. I found that indeed I had
overlooked one function to be replaced which is XwcLookupString.
The code in rxvt (command.C) has an alternative invocation of
Xutf8LookupString which is commented "// currently disabled, doesn't
seem to work, nor is useful".
It turns out that it is indeed very useful in making input work; the
reason the disabled rxvt code could not work is that the return
values are not handled properly.
* Finally, there was some occasional weird display garbage remaining
which I am describing below in some detail because there is some
really buggy rxvt code involved.
When displaying a long string to the screen it may happen that
rxvt splits a single UTF-8 character into subsequent fills of some
internal buffer. (I could not observe this on Linux, however, where
the buffer seems to be chosen always long enough to fit in the complete
output, whereas on cygwin it seems to have a maximum length of 257 bytes.)
Then at the end of the buffer, rxvt invokes mbrtowc with an incomplete
UTF-8 sequence:
mbrtowc (& wc, C3 BC E2, 3, & ps) -> 2, wc = FC
mbrtowc (& wc, E2, 1, & ps) -> -1, wc unchanged
now the continuation of E2, combining to E2 80 A7, the dot symbol U+2027:
mbrtowc (& wc, 80 A7 C3 A4 C3 B6 C3 9F ..., 257, & ps) -> -1, wc unchanged
mbrtowc (& wc, A7 C3 A4 C3 B6 C3 9F E2 ..., 256, & ps) -> -1 wc unchanged
mbrtowc (& wc, C3 A4 C3 B6 C3 9F E2 87 ..., 255, & ps) -> 2 wc = E4
The display produced is "üâ§ä" instead of "ü�ä".
A sample program xwrite.c demonstrating the bug is included in uwc.zip
(only if the "return NOCHAR" fix below has not yet been applied).
When I further analysed the mbrtowc function (on Linux where it works),
it turned out that it maintains a state of incomplete UTF-8 and is
able to automatically consider this with a continuation sequence
requested later. Also some comments in the rxvt source suggest that
rxvt might even depend on this undocumented behaviour. So I
reimplemented it with my cygwin mbrtowc replacement but the display
bug remained. It finally turned out that rxvt does not need this
"feature" (or rather bug, as it's not documented), at least not for
screen display.
So I checked the invocations of mbrtowc in rxvt in command.C and
menubar.C; I thought it was the latter because it's inside a function
called 'draw_string' which quite clearly suggests that it would be used
for screen display but it was not the case.
It rather turned out that the function 'next_char' in command.C is
handling screen output which is really weird (the function is
commented "// read the next octet").
The function has the return option
if (len == (size_t)-1) {
return *cmdbuf_ptr++;
with the comment
"// the _occasional_ latin1 character is allowed to slip through";
now this sounds mega-weird - why should something that't not right
be allowed to slip through? Anyway, replacing this with just
if (len == (size_t)-1) {
return NOCHAR;
finally solves the display problem and there we are with a working
rxvt-unicode on cygwin.
A remaining issue might be 'draw_string' in menubar.C; I don't know
what its purpose is.
The re-implementation of the setlocale functionality in my replacement
function which you correctly pointed out is still pending.
------------------------------------------------------------------------