This is the mail archive of the cygwin-developers@cygwin.com mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: fix cond_race... was RE: src/winsup/cygwin ChangeLog thread.cc thread.h ...


rethreaded to cygdev...

----- Original Message -----
From: "Jason Tishler" <jason@tishler.net>


> Rob,
>
> On Sun, Oct 07, 2001 at 10:24:30PM +1000, Robert Collins wrote:
> > From: "Jason Tishler" <jason@tishler.net>
> > > Unfortunately, Python's test_threadedtempfile regression test
still
> > > hangs (IIRC) in the same place.  See attached for details.
> >
> > I'm going to have to think about this one - unless your systems is
> > massively overloaded during the test - such that the spinloop around
> > line 482 is able to get 10 timeslices without the waiting thread
getting
> > 1?!? - there should be no way to tickle this.
> >
> > I'd like you to add a system_printf, at line 483, something like
> > "system_printf ("repulsing event at count 5\n"); (oh, and put it at
the
> > PulseEvent in {}. If that fires then we know that the detection code
is
> > ok. If so, can you try bumping the spin count up, and make the
pulsevent
> > fire if spins mod 5 == 0 ?
>
> With the attached patch applied to thread.cc version 1.52, Python's
> test_threadedtempfile regression test still hangs in the same place.
> Did I alter the code as you intended above or did I misunderstand?
>
> When I run test_threadedtempfile, I get the following output:
>
>       0 [main] python 2024 pthread_cond::Signal: repulsing event at
count 995
> 382380484 [unknown (0x520)] python 2024 pthread_cond::Signal:
repulsing event at
>  count 0
> ..
>
> So the repulse event is occurring, but I don't think that it is having
any
> affect.

Which means that there are
a) listed waiting threads AND
b) none have called WaitForSingleObject yet.

> Is there anything else that you would like me to try?

Yes. I'll write more in ~ 1 hr. The basic game plan is that we need to
figure out why the other thread is not getting to call WFSO. The fix I
put in place is meant to follow the following logic:

Any thread altering the state of a condition object grabs an access
mutex to ensure atomic alterations. The access mutex must never be kept
across blocking system calls.

threads that want to wait on the cond variable atomically increment a
waiting thread counter for both performance and race fixing reasons.
They then release the mutex and call WFSO on the cond event object.

These three operations are not bound into an atomic unit.

So to prevent lost signals, upon wake up, threads atomically decrement
the waiting thread counter _before_ grabbing the cond variable access
mutex. So the woken thread will block until the signaller releases the
access mutex.

This allows detection of lost signals by the signaller - if the waiting
thread count does not decrement, then either you have a crashed thread
(unlikely as this is completely within our code) or the waiting thread
has not had enough timeslices to call WFSO yet. So the signaller gives
up the cpu and tries again... and again... The count of 5 between tries
is an attempt to prevent releasing multiple threads because we're not
waiting long enough!.

There is a second potential race in this, which is multiple waiters
entering and altering the waiting thread count. That is solved by the
cond access mutex which is kept locked by the signaller.

So the problem you have Jason, is simply that none of the waiting
threads have called WFSO AND are not being given enough CPU time to do
so.

There are several reasons that this could happen.
1) the thread waiting count is wrong and there are actually no threads
waiting when Signal occurs. Then the Wait will always fail.
2) there is a syncronisation issue with entry to the cond access mutex
between the waiter and the signaller.
3) (and this is nasty one) the signaller is at a higher priority level
than the waiter. This will result (IIR my terminology C) in an inverted
priority situation, which NT does not handle. (This is why hard RT folk
still shun NT kernels).

For 1) gdb is your friend.
For 2) system printfs and I are your friends.
for 3) try temporarily dropping the priority of the signaller before it
signals and restoring before exiting the access mutex.
(pthead_setpriority should do).

Of course, if python doesn't set thread priority then 3 is unlikely.

Rob


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]