Re: Possible race between CPU hotplug and perf_pmu_migrate_context

From: Peter Zijlstra
Date: Mon Sep 08 2014 - 04:39:17 EST


On Fri, Sep 05, 2014 at 10:31:49AM -0700, Linus Torvalds wrote:
> So quite frankly, the whole perf_pmu_migrate_context() thing looks
> completely and fundamentally broken.
>
> Your patch doesn't really solve anything in general, and would need to
> be extended to do that get_online_cpus() around every single case
> where we read it. Which isn't really acceptable.
>
> perf_pmu_migrate_context() is just f*cked up. It needs to be reverted.
> It seems to think that "the event->ctx" pointer is a RCU-protected
> pointer, but it really isn't. Its lifetime is protected by
> refcounting, and the pointer access itself isn't protected by
> anything, because "event->ctx" isn't supposed to change over the
> lifetime of "event". Nobody does all the necessary rcu_read_[un]lock()
> around it, nor access it with rcu_dereference(), because all users
> know that it's just fixed.

Its worse; there's a second site that does exactly the same broken thing
:/

The 'move_group' logic in sys_perf_event_open() migrates a software
event into a hardware context when creating event groups.

> Quite frankly, I think commit 0cda4c023132 ("perf: Introduce
> perf_pmu_migrate_context()") is unfixably broken. The whole thing
> needs to be re-thought. It violates everything that everybody else
> does with the context. No amount of (sane) locking can fix the
> breakage, I suspect.

So yes and no; yes for sanity and consistency, no because after we've
done perf_remove_from_context() the event itself is inactive and we only
need to 'worry' about external (mostly userspace and exit paths)
interaction.

The hard exclusion provided by a rwsem/hotplug lock can do this, but I
agree its quite horrible.

> A possible (but unfortunate) real fix is to try to keep the broken
> code around, and make its broken assumptions be actually true. Make
> "event->ctx" truly be an RCU pointer. Add the RCU annotations, add the
> required rcu read-locking, and make everything really do what that
> currently completely broken perf_pmu_migrate_context() thinks it does.

I'm not entirely sure RCU works here.

> That implies, for example, that any time you use the thing in a
> sleeping context, you'd need to do something like this (possibly with
> a helper function to do that: "get_event_ctx()"
>
> rcu_read_lock();
> ctx = rcu_dereference(event->rcu);
> get_ctx(ctx);
> rcu_read_unlock();
>
> .. you can now sleep here and use 'ctx' ...
>
> put_ctx(ctx);
>
> and if you don't need to sleep, you can just do
>
> rcu_read_lock();
> ctx = rcu_dereference(event->ctx);
> .. use ctx in an atomic context ...
> rcu_read_unlock();
>
> but that is a much bigger patch than either of the "introduce a
> completely broken ad-hoc lock around just one random use case"
> patches.

So the problem with an RCU like approach is that it would allow multiple
users to see different ctxs. Even calling synchronize_rcu() in between
doesn't really solve that because we don't clear the event->ctx pointer
after remove_from_context, install_in_context simply sets a new one.

If we were to clear the context pointer, all usage sites would have to
read/wait until there's a valid pointer again, at which point we've just
build a rwsem with rcu.

> Alternatively (and this sounds like the better fix), use some way to
> avoid that broken perf_pmu_migrate_context() model entirely. Never
> "migrate" events. Create completely new ones on the new CPU, and leave
> the old stale ones alone even on dead CPU's (they will presumably not
> actually be activated, and who the hell cares?).

That appears to simply move the problem; instead of event->ctx being the
problem we then get filp->private_data the exact same problem. It might
be a better delimited problem perhaps, but adds a little complexity
because we need to copy the event state around.

Similarly we can't use RCU here because allowing concurrent operations
on different 'versions' of the event makes it very 'interesting' to sync
state between the old and new object.

> Or even just say: "if somebody takes down a CPU with existing uncore
> events, those events are now effectively dead". Don't try to migrate
> them, don't try to create new events on another CPU, just let it go.
> The CPU is down, the events are inactive.

Which, is indeed very attractive; whoever with the additional
'move_group' problem I'm not entirely seeing how we can make that
happen. It would end up disallowing certain event groups that are
currently possible. Most notable a software group leader with hardware
events -- a most useful construct for those people who don't have
interrupts on their hardware PMUs.

Lemme ponder this a bit more.

Attachment: pgpv3RtVTRXsh.pgp
Description: PGP signature