[PATCH] sched, cgroup: Use exit hook to avoid use-after-free crash

From: Peter Zijlstra
Date: Fri Dec 24 2010 - 10:59:37 EST


On Fri, 2010-12-24 at 13:16 +0100, Mike Galbraith wrote:
> On Fri, 2010-12-24 at 11:54 +0100, Peter Zijlstra wrote:

> > Right, so the cgroup core is supposed to already emit -EBUSY when there
> > are associated tasks with the cgroup, that _should_ be sufficient, the
> > pre_destroy() method is to frob some extra constraints or somesuch.
> >
> > Our problem looks to be that a task (afaict usually current) changes
> > cgroups without us getting notified of it. On destruction the task is
> > still enqueued in the cfs_rq being destroyed but is not actually part of
> > that cgroup according to the task->css bits.
>
> Could it be an exiting task? We're still preemptible, and iirc, you run
> a CONFIG_PREEMPT kernel. (grasp at all straws;)
>
> cgroup_exit:
> /* Reassign the task to the init_css_set. */
> task_lock(tsk);
> cg = tsk->cgroups;
> tsk->cgroups = &init_css_set;
> task_unlock(tsk);
> if (cg)
> put_css_set_taskexit(cg);
>

This straw appears true:

$ grep -e cpu_cgroup\\\|f491447c log9

...

kworker/-1196 0d..2. 1601180us : __print_runqueue: se: f491447c, comm: modprobe/1210, state: 0, load: 1024, cgroup: /system/systemd-modules-load.service
kworker/-1196 0d..2. 1601186us : __print_runqueue: se: f491447c, comm: modprobe/1210, state: 0, load: 1024, cgroup: /system/systemd-modules-load.service
kworker/-1196 0d..2. 1601188us : __dequeue_entity: f491447c from f492a480, 1 left
kworker/-1196 0d..2. 1601188us : pick_next_task_fair: picked: f491447c, modprobe/1210
kworker/-1196 0d..2. 1601192us : __print_runqueue: curr: f491447c, comm: modprobe/1210, state: 0, load: 1024, cgroup: /system/systemd-modules-load.service
modprobe-1210 0d..5. 1601802us : __print_runqueue: curr: f491447c, comm: modprobe/1210, state: 0, load: 1024, cgroup: /
modprobe-1210 0d..5. 1601807us : __print_runqueue: curr: f491447c, comm: modprobe/1210, state: 0, load: 1024, cgroup: /
modprobe-1210 0d..2. 1601817us : __print_runqueue: curr: f491447c, comm: modprobe/1210, state: 0, load: 1024, cgroup: /
modprobe-1210 0d..2. 1601819us : __enqueue_entity: f491447c to f492a480, 1 tasks
modprobe-1210 0d..2. 1601826us : __print_runqueue: se: f491447c, comm: modprobe/1210, state: 0, load: 1024, cgroup: /
modprobe-1210 0d..2. 1601832us : __print_runqueue: se: f491447c, comm: modprobe/1210, state: 0, load: 1024, cgroup: /
modprobe-1210 0d..2. 1601839us : __print_runqueue: se: f491447c, comm: modprobe/1210, state: 0, load: 1024, cgroup: /
kworker/-1196 0d..2. 1601848us : __print_runqueue: se: f491447c, comm: modprobe/1210, state: 0, load: 1024, cgroup: /
kworker/-1196 0d..2. 1601854us : __print_runqueue: se: f491447c, comm: modprobe/1210, state: 0, load: 1024, cgroup: /
kworker/-1196 0d..2. 1601860us : __print_runqueue: se: f491447c, comm: modprobe/1210, state: 0, load: 1024, cgroup: /
kworker/-1196 0d..2. 1601865us : __print_runqueue: se: f491447c, comm: modprobe/1210, state: 0, load: 1024, cgroup: /
kworker/-1196 0d..2. 1601871us : __print_runqueue: se: f491447c, comm: modprobe/1210, state: 0, load: 1024, cgroup: /
kworker/-1196 0d..2. 1601872us : __dequeue_entity: f491447c from f492a480, 1 left
kworker/-1196 0d..2. 1601873us : pick_next_task_fair: picked: f491447c, modprobe/1210
kworker/-1196 0d..2. 1601876us : __print_runqueue: curr: f491447c, comm: modprobe/1210, state: 0, load: 1024, cgroup: /
modprobe-1210 0d..7. 1601895us : __print_runqueue: curr: f491447c, comm: modprobe/1210, state: 16, load: 1024, cgroup: /
modprobe-1210 0d..7. 1601900us : __print_runqueue: curr: f491447c, comm: modprobe/1210, state: 16, load: 1024, cgroup: /
modprobe-1210 0d..2. 1601909us : __print_runqueue: curr: f491447c, comm: modprobe/1210, state: 16, load: 1024, cgroup: /
modprobe-1210 0d..2. 1601911us : __enqueue_entity: f491447c to f492a480, 1 tasks
modprobe-1210 0d..2. 1601918us : __print_runqueue: se: f491447c, comm: modprobe/1210, state: 16, load: 1024, cgroup: /
modprobe-1210 0d..2. 1601924us : __print_runqueue: se: f491447c, comm: modprobe/1210, state: 16, load: 1024, cgroup: /
modprobe-1210 0d..2. 1601931us : __print_runqueue: se: f491447c, comm: modprobe/1210, state: 16, load: 1024, cgroup: /
kworker/-1196 0d..2. 1602071us : __print_runqueue: se: f491447c, comm: modprobe/1210, state: 16, load: 1024, cgroup: /
kworker/-1196 0d..2. 1602080us : __print_runqueue: se: f491447c, comm: modprobe/1210, state: 16, load: 1024, cgroup: /
kworker/-1196 0d..2. 1602089us : __print_runqueue: se: f491447c, comm: modprobe/1210, state: 16, load: 1024, cgroup: /
kworker/-1196 0d..2. 1602097us : __print_runqueue: se: f491447c, comm: modprobe/1210, state: 16, load: 1024, cgroup: /
kworker/-1196 0d..2. 1602105us : __print_runqueue: se: f491447c, comm: modprobe/1210, state: 16, load: 1024, cgroup: /
kworker/-1196 0d..2. 1602107us : __dequeue_entity: f491447c from f492a480, 1 left
kworker/-1196 0d..2. 1602108us : pick_next_task_fair: picked: f491447c, modprobe/1210
kworker/-1196 0d..2. 1602114us : __print_runqueue: curr: f491447c, comm: modprobe/1210, state: 16, load: 1024, cgroup: /
modprobe-1210 0d..3. 1602128us : __print_runqueue: curr: f491447c, comm: modprobe/1210, state: 80, load: 1024, cgroup: /


So cgroup moves a task without calling cgroup_subsys::attach() which is
odd, but it does have an ::exit method, sadly it calls that _before_
re-assigning the task, which means we have to jump through some hoops.

The below seems to fix the problem for me..

---
Subject: sched, cgroup: Use exit hook to avoid use-after-free crash

By not notifying the controller of the on-exit move back to
init_css_set, we fail to move the task out of the previous cgroup's
cfs_rq. This leads to an opportunity for a cgroup-destroy to come in and
free the cgroup (there are no active tasks left in it after all) to
which the not-quite dead task is still enqueued.

Cc: stable@xxxxxxxxxx
Reported-by: Miklos Vajna <vmiklos@xxxxxxxxxxxxxx>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@xxxxxxxxx>
---
kernel/sched.c | 10 ++++++++++
1 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 7e401f8..572625c 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -611,6 +611,9 @@ static inline struct task_group *task_group(struct task_struct *p)
struct task_group *tg;
struct cgroup_subsys_state *css;

+ if (p->flags & PF_EXITING)
+ return &root_task_group;
+
css = task_subsys_state_check(p, cpu_cgroup_subsys_id,
lockdep_is_held(&task_rq(p)->lock));
tg = container_of(css, struct task_group, css);
@@ -8887,6 +8890,12 @@ cpu_cgroup_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
}
}

+static void
+cpu_cgroup_exit(struct cgroup_subsys *ss, struct task_struct *task)
+{
+ sched_move_task(task);
+}
+
#ifdef CONFIG_FAIR_GROUP_SCHED
static int cpu_shares_write_u64(struct cgroup *cgrp, struct cftype *cftype,
u64 shareval)
@@ -8959,6 +8968,7 @@ struct cgroup_subsys cpu_cgroup_subsys = {
.destroy = cpu_cgroup_destroy,
.can_attach = cpu_cgroup_can_attach,
.attach = cpu_cgroup_attach,
+ .exit = cpu_cgroup_exit,
.populate = cpu_cgroup_populate,
.subsys_id = cpu_cgroup_subsys_id,
.early_init = 1,


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/