Re: [PATCH 1/2] pidns: guarantee that the pidns init will be the last pidns process reaped

From: Eric W. Biederman
Date: Wed Jun 13 2012 - 18:17:42 EST


Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> writes:

>> --- a/kernel/exit.c
>> +++ b/kernel/exit.c
>> @@ -64,7 +64,6 @@ static void exit_mm(struct task_struct * tsk);
>> static void __unhash_process(struct task_struct *p, bool group_dead)
>> {
>> nr_threads--;
>> - detach_pid(p, PIDTYPE_PID);
>> if (group_dead) {
>> detach_pid(p, PIDTYPE_PGID);
>> detach_pid(p, PIDTYPE_SID);
>> @@ -72,7 +71,20 @@ static void __unhash_process(struct task_struct *p, bool group_dead)
>> list_del_rcu(&p->tasks);
>> list_del_init(&p->sibling);
>> __this_cpu_dec(process_counts);
>> + /*
>> + * If we are the last child process in a pid namespace to be
>> + * reaped, notify the reaper sleeping zap_pid_ns_processes().
>> + */
>> + if (IS_ENABLED(CONFIG_PID_NS)) {
>> + struct task_struct *parent = p->real_parent;
>> +
>> + if ((task_active_pid_ns(p)->child_reaper == parent) &&
>> + list_empty(&parent->children) &&
>> + (parent->flags & PF_EXITING))
>> + wake_up_process(parent);
>> + }
>> }
>> + detach_pid(p, PIDTYPE_PID);
>> list_del_rcu(&p->thread_group);
>> }
>>
>> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
>> index 57bc1fd..41ed867 100644
>> --- a/kernel/pid_namespace.c
>> +++ b/kernel/pid_namespace.c
>> @@ -179,11 +179,31 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
>> }
>> read_unlock(&tasklist_lock);
>>
>> + /* Firstly reap the EXIT_ZOMBIE children we may have. */
>> do {
>> clear_thread_flag(TIF_SIGPENDING);
>> rc = sys_wait4(-1, NULL, __WALL, NULL);
>> } while (rc != -ECHILD);
>>
>> + /*
>> + * sys_wait4() above can't reap the TASK_DEAD children.
>> + * Make sure they all go away, see __unhash_process().
>> + */
>> + for (;;) {
>> + bool need_wait = false;
>> +
>> + read_lock(&tasklist_lock);
>> + if (!list_empty(&current->children)) {
>> + __set_current_state(TASK_UNINTERRUPTIBLE);
>> + need_wait = true;
>> + }
>> + read_unlock(&tasklist_lock);
>> +
>> + if (!need_wait)
>> + break;
>> + schedule();
>
> This sleep is terminated by the above wake_up_process(), yes?

Yes.

> But that wake_up_process() only happens if CONFIG_PID_NS. It's
> unobvious (to me) that we can't get stuck in D state if
> CONFIG_PID_NS=n. If bug, please fix. If not bug, please make obvious
> to me ;)

The file pid_namespace.c that holds zap_pid_ns_processes is only
compiled with CONFIG_PID_NS set. So we can't possibly get stuck with
CONFIG_PID_NS=n.

> <looks at the locking a bit>
>
> <gets distracted>
>
> That tty_kref_put() in __exit_signal() is running with tasklist_lock
> held, yes? It does a ton of work and calls out to random drivers and
> none of this needs tasklist_lock. Seems risky.

Interesting. That tty_kref_put does sound like an area where the
locking can be simplified. At the same time tty_kref_put does make
sense from exit signal. As ttys and signals are intimately intertwined.

Thank you for taking the time looking at this and applying this change.

I have to agree that the tasklist_lock is pretty horrible, and that if
we can somehow figure out how to remove it we would be in a much better
situation with lock contention. Unfortunately that is an alligator and
I am working to drain the swamp.

Eric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/