[tip: sched/urgent] tasks, sched/core: Ensure tasks are available for a grace period after leaving the runqueue

From: tip-bot2 for Eric W. Biederman
Date: Fri Sep 27 2019 - 04:11:34 EST


The following commit has been merged into the sched/urgent branch of tip:

Commit-ID: 0ff7b2cfbae36ebcd216c6a5ad7f8534eebeaee2
Gitweb: https://git.kernel.org/tip/0ff7b2cfbae36ebcd216c6a5ad7f8534eebeaee2
Author: Eric W. Biederman <ebiederm@xxxxxxxxxxxx>
AuthorDate: Sat, 14 Sep 2019 07:33:58 -05:00
Committer: Ingo Molnar <mingo@xxxxxxxxxx>
CommitterDate: Wed, 25 Sep 2019 17:42:29 +02:00

tasks, sched/core: Ensure tasks are available for a grace period after leaving the runqueue

In the ordinary case today the RCU grace period for a task_struct is
triggered when another process wait's for it's zombine and causes the
kernel to call release_task(). As the waiting task has to receive a
signal and then act upon it before this happens, typically this will
occur after the original task as been removed from the runqueue.

Unfortunaty in some cases such as self reaping tasks it can be shown
that release_task() will be called starting the grace period for
task_struct long before the task leaves the runqueue.

Therefore use put_task_struct_rcu_user() in finish_task_switch() to
guarantee that the there is a RCU lifetime after the task
leaves the runqueue.

Besides the change in the start of the RCU grace period for the
task_struct this change may cause perf_event_delayed_put and
trace_sched_process_free. The function perf_event_delayed_put boils
down to just a WARN_ON for cases that I assume never show happen. So
I don't see any problem with delaying it.

The function trace_sched_process_free is a trace point and thus
visible to user space. Occassionally userspace has the strangest
dependencies so this has a miniscule chance of causing a regression.
This change only changes the timing of when the tracepoint is called.
The change in timing arguably gives userspace a more accurate picture
of what is going on. So I don't expect there to be a regression.

In the case where a task self reaps we are pretty much guaranteed that
the RCU grace period is delayed. So we should get quite a bit of
coverage in of this worst case for the change in a normal threaded
workload. So I expect any issues to turn up quickly or not at all.

I have lightly tested this change and everything appears to work
fine.

Inspired-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
Inspired-by: Oleg Nesterov <oleg@xxxxxxxxxx>
Signed-off-by: Eric W. Biederman <ebiederm@xxxxxxxxxxxx>
Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
Cc: Chris Metcalf <cmetcalf@xxxxxxxxxx>
Cc: Christoph Lameter <cl@xxxxxxxxx>
Cc: Davidlohr Bueso <dave@xxxxxxxxxxxx>
Cc: Kirill Tkhai <tkhai@xxxxxxxxx>
Cc: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
Cc: Mike Galbraith <efault@xxxxxx>
Cc: Paul E. McKenney <paulmck@xxxxxxxxxx>
Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Cc: Russell King - ARM Linux admin <linux@xxxxxxxxxxxxxxx>
Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
Link: https://lkml.kernel.org/r/87r24jdpl5.fsf_-_@xxxxxxxxxxxxxxxxxxxxx
Signed-off-by: Ingo Molnar <mingo@xxxxxxxxxx>
---
kernel/fork.c | 11 +++++++----
kernel/sched/core.c | 2 +-
2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 7eefe33..d6e5525 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -902,10 +902,13 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
if (orig->cpus_ptr == &orig->cpus_mask)
tsk->cpus_ptr = &tsk->cpus_mask;

- /* One for the user space visible state that goes away when reaped. */
- refcount_set(&tsk->rcu_users, 1);
- /* One for the rcu users, and one for the scheduler */
- refcount_set(&tsk->usage, 2);
+ /*
+ * One for the user space visible state that goes away when reaped.
+ * One for the scheduler.
+ */
+ refcount_set(&tsk->rcu_users, 2);
+ /* One for the rcu users */
+ refcount_set(&tsk->usage, 1);
#ifdef CONFIG_BLK_DEV_IO_TRACE
tsk->btrace_seq = 0;
#endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 06961b9..5e5fefb 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3254,7 +3254,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
/* Task is done with its stack. */
put_task_stack(prev);

- put_task_struct(prev);
+ put_task_struct_rcu_user(prev);
}

tick_nohz_task_switch();