This is the mail archive of the
cygwin-developers
mailing list for the Cygwin project.
1.5.24 (and later): race condition in sigproc.cc
- From: "Scott Stanton" <stanton at electric-cloud dot com>
- To: <cygwin-developers at cygwin dot com>
- Cc: "Scott Stanton" <stanton at electric-cloud dot com>
- Date: Fri, 13 Jul 2007 17:09:24 -0700
- Subject: 1.5.24 (and later): race condition in sigproc.cc
I have found what I believe to be a race condition in sigproc.cc and
exceptions.cc. The problem is that any access to the in_dllentry
variable defined in init.cc is vulnerable to a race condition when a new
thread is being initialized.
The initial symptom is that under heavy load on a multiprocessor machine
cygwin processes intermittently fail with a "fork: Resource temporarily
unavailable" error. I tracked this to the sig_send() calls inside
fork(). These calls were failing, causing fork to return EAGAIN. The
sig_send() call was failing on the first no_signals_available() test.
After expanding the macro to see which arm of the test was failing, it
turns out that sig_send() was seeing a non-zero value for in_dllentry.
This boolean is set during the call to dll_entry() whenever a process or
thread attaches to the cygwin dll. Because the in_dllentry variable is
checked without holding a mutex, threads calling sig_send() can
temporarily see the value as true when a new thread is starting. If the
sig_send() code is modified to retry after yielding the processor, the
second attempt succeeds.
I have a test case that reproduces the problem within a few minutes of
run time by doing several compiles at the same time, but it's not really
small enough to submit with a patch.
I have written a workaround patch (see below) that introduces a retry
loop, but I don't think this the right solution. I don't fully
understand the conditions that the in_dllentry flag is trying to protect
against. It seems like this should be done with a mutex instead of an
unprotected access to a global variable. On a multi-cpu system this
sort of unprotected access is vulnerable to visibility and reordering
problems. In addition, there is no guarantee that a new thread won't
enter the dll_entry() function immediately after the test.
Do you have a suggestion for how to handle this properly? Should the
in_dllentry boolean be replaced with a mutex? Or is there some other
way to guard against these failures that would be preferable. Also, the
patch below only fixes one of several uses of in_dllentry.
--Scott
Index: sigproc.cc
===================================================================
RCS file: /cvs/src/src/winsup/cygwin/sigproc.cc,v
retrieving revision 1.303
diff -u -u -p -r1.303 sigproc.cc
--- sigproc.cc 20 Feb 2007 00:16:17 -0000 1.303
+++ sigproc.cc 13 Jul 2007 23:52:27 -0000
@@ -576,8 +576,15 @@ sig_send (_pinfo *p, siginfo_t& si, _cyg
}
else
{
- if (no_signals_available (si.si_signo != __SIGEXIT))
- {
+#define MAX_RETRIES 20
+ int retries = 0;
+ while (no_signals_available (si.si_signo != __SIGEXIT))
+ {
+ if (retries++ < MAX_RETRIES)
+ {
+ low_priority_sleep (0);
+ continue;
+ }
sigproc_printf ("my_sendsig %p, myself->sendsig %p, exit_state
%d",
my_sendsig, myself->sendsig, exit_state);
set_errno (EAGAIN);