Re: hit a KASan bug related to Perf during stress test

From: Oleg Nesterov
Date: Mon Oct 24 2016 - 07:17:03 EST


On 10/24, Peter Zijlstra wrote:
>
> > [32738.867020] [<ffffffff810d9975>] task_tgid_nr_ns+0x35/0xb0
>
> So here we did: perf_event_[pt]id(event, current);
>
> How can _current_ not be valid anymore?

...

> > [32739.040207] [<ffffffff81135a4c>] __call_rcu+0x12c/0x450
>
> And while we just called release_task(), that call_rcu() should still be
> pending at this point,

Yes, current is still valid.

But nothing protects current->group_leader or parent/real_parent, they
can point to the exited/freed task. We really need to nullify them in
__unhash_process() to catch the problems like this, I wanted to do this
many times...

So you simply can't know your tgid or even tid after release_task() calls
__unhash_process(). Actually after exit_notify() unless the exiting task
autoreaps itself.

How about the trivial fix below?

Oleg.

--- x/kernel/events/core.c
+++ x/kernel/events/core.c
@@ -1257,7 +1257,7 @@ static u32 perf_event_pid(struct perf_ev
if (event->parent)
event = event->parent;

- return task_tgid_nr_ns(p, event->ns);
+ return pid_alive(p) ? task_tgid_nr_ns(p, event->ns) : 0;
}

static u32 perf_event_tid(struct perf_event *event, struct task_struct *p)