Re: [PATCH] kill_something_info: don't take tasklist_lock for pid==-1 case

From: Eric W. Biederman
Date: Sat May 31 2008 - 20:01:18 EST


Oleg Nesterov <oleg@xxxxxxxxxx> writes:

> Sorry, sorry for the delay,
>
> On 05/20, Eric W. Biederman wrote:
>>
>> Oleg Nesterov <oleg@xxxxxxxxxx> writes:
>>
>> > On 03/25, Atsushi Tsuji wrote:
>> >>
>> >> This patch avoid taking tasklist_lock in kill_something_info() when
>> >> the pid is -1. It can convert to rcu_read_lock() for this case because
>> >> group_send_sig_info() doesn't need tasklist_lock.
>> >>
>> >> This patch is for 2.6.25-rc5-mm1.
>> >>
>>
>> > Hmm. Yes, group_send_sig_info() doesn't need tasklist_lock. But we
>> > take tasklist_lock to "freeze" the tasks list, so that we can't miss
>> > a new forked process.
>> >
>> > Same for __kill_pgrp_info(), we take tasklist to kill the whole group
>> > "atomically".
>> >
>> >
>> > However. Is it really needed? copy_process() returns -ERESTARTNOINTR
>> > if signal_pending(), and the new task is always placed at the tail
>> > of the list. Looks like nobody can escape the signal, at least fatal
>> > or SIGSTOP.
>>
>>
>> Call me paranoid but I don't think there is any guarantee without a lock
>> that we will hit the -ERESTARTNOITR check for new processes. I think we
>> have a slight race where the fork process may not have received the signal
>> (because it is near the tail of the list) but the new process would be
>> added to the list immediately after we read it's pointer.
>
> Hmm. could you clarify? I tend to always trust you, just can't understand
> the text above...

We can read old next values when walking the task list under the rcu
lock. So I don't believe we are guaranteed to see additions that
happen while we hold the rcu lock.

If a new process spawns, passes the check for the parent having the
signal, the signal is delivered the signal, and then appends to the
task list. We might miss it. I'm not certain, but that feels right.

> However, I think this patch adds another subtle race which I missed before.
>
> Let's suppose that the task has two threads, A (== main thread) and B. A has
> already exited, B does exec.
>
> In that case it is possible that (without tasklist_lock) kill_something_info()
> sends the signal to the old leader (A), but before group_send_sig_info(A)
> takes ->siglock B switches the leader and does release_task(A). In that
> group_send_sig_info()->lock_task_sighand() fails and we miss the process.

Hmm. Does that problem affect normal signal deliver. I seem to recall
being careful and doing something to make that case work.

Does that fix only apply when we have a specific pid, not when we have
a task and are walking the task_list. Because kill_pid_info can retry
if we receive -ESRCH?

To fix the pid namespace case we need to start walking the list of
pids, not the task list for kill -1.

>> That is subtle. Switching to the per task siglock for protection.
>>
>> > Except: We don't send the signal to /sbin/init. This means that (say)
>> > kill(-1, SIGKILL) can miss the task forked by init. Note that this
>> > task could be forked even before we start kill_something_info(), but
>> > without tasklist there is no guarantee we will see it on the ->tasks
>> > list.
>>
>> Actually we do sent the signal to init but we shouldn't,
>
> Note the (broken) "p->pid > 1" check, kill_something_info() skips init.
> Not that it matters though.

Oh right. I had forgotten about that special case. Grr Special cases
suck!

We have a hole with init spawning new children, during kill -1.
Ugh. Or are those tasks indistinguishable from children spawned by
init just after the signal was sent?

A practical question. I need to rework the signal delivery for
the case of kill -1 to be based on find_ge_pid. So that it
works with pid namespaces. That kills our nice in order list
property. :( A rework does give the opportunity to investigate
these properties in detail, and see if there is a way to drop
the tasklist_lock. Especially since we are already discussing it.

The guarantee is that the signal delivery needs to appear atomic
so that we don't miss any processes in the group of processes
the signal is sent to. The rcu list gives us an effectively atomic
snapshot of the list. However new processes can be added while we
walk that list.

To prevent the current visible DOS we need to only hold processes
in fork for the duration of our traversal of the set of all processes
in the system and then let them go.

The -ERESTARTNOINTR trick ensures the signal is delivered to the
parent before a fork executes.

Init only receives signals that it wants (the rest are effectively
ignored and discared) so it can be said to have forked after signal
delivery when there is a race.

We grab sighand lock on each process that we are delivering the signal
to.

This almost feels like a recipe for atomic signal delivery. Without
an ordered list. I just don't see it yet. I will keep picking at it
since it sounds like we are close to the point where the tasklist is
a problem even on small systems.


Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/