Re: [PATCH] pidns: Guarantee that the pidns init will be the lastpidns process reaped. v2

From: Oleg Nesterov
Date: Wed May 23 2012 - 10:54:11 EST


On 05/22, Andrew Morton wrote:
>
> On Mon, 21 May 2012 18:20:31 -0600
> ebiederm@xxxxxxxxxxxx (Eric W. Biederman) wrote:
>
> > + /* If we are the last child process in a pid namespace
>
> like this:
> /*
> * If ...
>
> > + * to be reaped notify the child_reaper.
>
> s/reaped/reaped,/
>
> More seriously, it isn't a very good comment. It tells us "what" the
> code is doing (which is pretty obvious from reading it), but it didn't
> tell us "why" it is doing this. Why do PID namespaces need special
> handling here? What's the backstory??

Well, this is documented in the changelog but I agree, this needs some
documentation.

Perhaps zap_pid_ns_processes() should document that it waits for the
stealth EXIT_DEAD tasks, and __unhash_process() can simply say
"see zap_pid_ns_processes()".

> > @@ -189,6 +189,17 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
> > rc = sys_wait4(-1, NULL, __WALL, NULL);
> > } while (rc != -ECHILD);
> >
> > + read_lock(&tasklist_lock);
> > + for (;;) {
> > + __set_current_state(TASK_UNINTERRUPTIBLE);
> > + if (list_empty(&current->children))
> > + break;
> > + read_unlock(&tasklist_lock);
> > + schedule();
> > + read_lock(&tasklist_lock);
> > + }
> > + read_unlock(&tasklist_lock);
>
> Well.
>
> a) This loop can leave the thread in state TASK_UNINTERRUPTIBLE,
> which looks wrong.

OOPS. You fixed this in *-fix.patch, thanks.

> b) Given that the waking side is also testing list_empty(), I think
> you might need set_current_state() here, with the barrier. So that
> this thread gets the correct view of the list_head wrt the waker's
> view.

We rely on tasklist_lock, note that the waking side takes it for writing.

> c) Did we really need to bang on tasklist_lock so many times? There
> doesn't seem a nicer way of structuring this :(

See above.

We do not really need tasklist_lock to wait for list_empty(children),
we could add a couple of barriers or use wait_event(wait_chldexit).
But the explicit usage of tasklist makes the code more understandable,
this this way we obviously can't race with the child doing release_task(),
say, we can't return before the last detach_pid(PIDTYPE_PID).

As for re-structuring, I'd suggest

for (;;) {
bool need_wait = false;

read_lock(&tasklist_lock);
if (!list_empty(&current->children)) {
__set_current_state(TASK_UNINTERRUPTIBLE);
need_wait = true;
}
read_unlock(&tasklist_lock);

if (!need_wait)
break;
schedule();
}

but this is subjective.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/