This is the mail archive of the cygwin-developers mailing list for the Cygwin project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

1.5.24 (and later): race condition in sigproc.cc

From: "Scott Stanton" <stanton at electric-cloud dot com>
To: <cygwin-developers at cygwin dot com>
Cc: "Scott Stanton" <stanton at electric-cloud dot com>
Date: Fri, 13 Jul 2007 17:09:24 -0700
Subject: 1.5.24 (and later): race condition in sigproc.cc

I have found what I believe to be a race condition in sigproc.cc and
exceptions.cc.  The problem is that any access to the in_dllentry
variable defined in init.cc is vulnerable to a race condition when a new
thread is being initialized.

The initial symptom is that under heavy load on a multiprocessor machine
cygwin processes intermittently fail with a "fork: Resource temporarily
unavailable" error.  I tracked this to the sig_send() calls inside
fork().  These calls were failing, causing fork to return EAGAIN.  The
sig_send() call was failing on the first no_signals_available() test.
After expanding the macro to see which arm of the test was failing, it
turns out that sig_send() was seeing a non-zero value for in_dllentry.
This boolean is set during the call to dll_entry() whenever a process or
thread attaches to the cygwin dll.  Because the in_dllentry variable is
checked without holding a mutex, threads calling sig_send() can
temporarily see the value as true when a new thread is starting.  If the
sig_send() code is modified to retry after yielding the processor, the
second attempt succeeds.

I have a test case that reproduces the problem within a few minutes of
run time by doing several compiles at the same time, but it's not really
small enough to submit with a patch.

I have written a workaround patch (see below) that introduces a retry
loop, but I don't think this the right solution.  I don't fully
understand the conditions that the in_dllentry flag is trying to protect
against.  It seems like this should be done with a mutex instead of an
unprotected access to a global variable.  On a multi-cpu system this
sort of unprotected access is vulnerable to visibility and reordering
problems.  In addition, there is no guarantee that a new thread won't
enter the dll_entry() function immediately after the test.

Do you have a suggestion for how to handle this properly?  Should the
in_dllentry boolean be replaced with a mutex?  Or is there some other
way to guard against these failures that would be preferable.  Also, the
patch below only fixes one of several uses of in_dllentry.

--Scott

Index: sigproc.cc
===================================================================
RCS file: /cvs/src/src/winsup/cygwin/sigproc.cc,v
retrieving revision 1.303
diff -u -u -p -r1.303 sigproc.cc
--- sigproc.cc  20 Feb 2007 00:16:17 -0000      1.303
+++ sigproc.cc  13 Jul 2007 23:52:27 -0000
@@ -576,8 +576,15 @@ sig_send (_pinfo *p, siginfo_t& si, _cyg
     }
   else
     {
-      if (no_signals_available (si.si_signo != __SIGEXIT))
-       {
+#define MAX_RETRIES 20
+      int retries = 0;
+      while (no_signals_available (si.si_signo != __SIGEXIT))
+       {
+          if (retries++ < MAX_RETRIES)
+            {
+              low_priority_sleep (0);
+              continue;
+            }
          sigproc_printf ("my_sendsig %p, myself->sendsig %p, exit_state
%d",
                          my_sendsig, myself->sendsig, exit_state);
          set_errno (EAGAIN);

Follow-Ups:
- Re: 1.5.24 (and later): race condition in sigproc.cc
  - From: Christopher Faylor

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]