This is the mail archive of the
cygwin-developers
mailing list for the Cygwin project.
How to make child of failed fork exit cleanly?
- From: Ryan Johnson <ryan dot johnson at cs dot utoronto dot ca>
- To: cygwin-developers at cygwin dot com
- Date: Tue, 03 May 2011 11:46:01 -0400
- Subject: How to make child of failed fork exit cleanly?
Hi all,
I'm working on some changes to fork() which would detect early the case
where a parent-child pair have unresolvable differences in address space
layout (e.g. thread stacks, heaps, or statically-linked dlls which moved).
Detecting the problem turned out to be pretty easy, but making the child
exit cleanly is not. This leads to two questions, followed by what I
have figured out so far while attempting to answer them myself.
1. What's the best way to make a child process notify the parent that
the fork() cannot succeed, and exit cleanly?
2. When the child does exit, how to prevent finalizers from running for
dlls which did not load properly?
Context for the first question: Existing fork failure code calls
api_fatal(), but that sends messages to the terminal and generates a
stack trace, in addition to the desired result of making the parent's
fork() call return an error message. Further, Windows 7 treats such an
exit as grounds for an automatic process restart, and respawns the
failed child up to five more times before giving up. The result is a
screen full of error messages and stack traces even if the fork
eventually succeeds. It's especially annoying under terminal apps like
emacs or screen, where the messages clutter up the display pretty badly.
Given that the cause of the fork failure is known (rather than some
surprise or bug), I propose that the messages go to some strace channel
(a new one for fork, perhaps?) and that the child exit without
attempting to generate a dump file (especially since dump generation
itself has a tendency to cause crashes). It would also be good, in cases
where the parent is the reason for fork failures, to prevent Windows
from respawning the process so many times (though it is admittedly handy
when the child was the problem and the fork succeeds on the nth try).
All of this still leaves the question of how to exit the child process,
"properly" though. Is it necessary to wait for dll initialization to
finish first, for example?
Context for the second question: exiting the child tends to trigger
access violations, often in a pthread_mutex destructor call (la-la
land). Some of these can be avoided by disabling stack dumping from
api_fatal (see separate email about alloca and stack walking), but the
others continue to mystify.
Overal, AFAICT, the cygwin dll design assumes that all dlls have loaded
properly, and a failed fork breaks that invariant. I worry that some
"properly-loaded" dll accesses state of a "not-properly loaded"
dependency, but haven't been able to eliminate fully two simpler
explanations yet:
(a) A statically-linked dll maps to a different address in the child
than the parent, and because copied-over dll state references addresses
which are valid in the parent but not the child, dll initialization
crashes. For example, this was probably responsible for the access
violations I reported earlier [1]. I've verified that this can be
avoided by checking for handle mismatches in dll_list::alloc and forcing
an early exit, but this leads to...
(b) Finalizers run for a dynamically-linked dll which never loaded
(and/or a statically-linked dll which loaded to the wrong location -- I
can't tell). I've tried inserting checks in a few places to not run
finalizers unless the after-fork initialization completed, by extending
dll_list entries to say whether a given dll initialized properly, but
I've clearly not isolated the cause because the access violations
continue. Part of the challenge is that the dll_list copied over from
the parent process will always say that every dll initialized properly.
It also doesn't help that many dll initializers run before cygwin1.dll
(sometimes even other cygwin dlls, if they've been dynamically rebased),
so the value of in_forkee is reliable.
[1] http://cygwin.com/ml/cygwin-developers/2011-04/msg00006.html
Thoughts?
Ryan