Re: Possible bug introduced in commit 9b84cca

From: Oleg Nesterov
Date: Thu Dec 29 2011 - 06:38:39 EST


On 12/28, Denys Vlasenko wrote:
>
> Looks like after commit 9b84cca, waitpid under strace
> sometimes returns bogus ECHILD while child does exist.
>
> I did not yet confirm that the bug appeared exactly
> at this commit - Åukasz says that.
>
> I confirmed that bug exists on kernels 3.1.6 (in Fedora)
> and 3.1.0-rc4 (vanilla).
>
> We have a testcase which spawns N threads, each of them
> performs an infinite loop "fork, exit in child, waitpid
> in parent for the child". When straced, sometimes waitpid
> returns ECHILD.

You mean, the natural parent gets ECHILD, not strace?

> The key part is here:
>
> 931 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xf763dbd8) = 1048
> 1048 exit_group(42) = ?
> 931 waitpid(1048, <unfinished ...>
> 1048 +++ exited with 42 +++
> 931 <... waitpid resumed> 0xf763d3a0, 0) = -1 ECHILD (No child processes)

Argh. I seem to understand

I didn't check, but I think the offending commit is 823b018e5b1196d8
"job control: Small reorganization of wait_consider_task()".

ptracer sees EXIT_ZOMBIE and temporary sets EXIT_DEAD, this fools
the ->real_parent.

I need to think. The fix should be simple, but perhaps it is the
time to kill EXIT_DEAD altogether. I'll try to make the patch
after vacation. In the next year ;)

Thanks a lot Denys!

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/