Re: [RFC][PATCH] perf: Rewrite enabled/running timekeeping
From: Stephane Eranian
Date:  Tue Sep 05 2017 - 03:51:46 EST
On Thu, Aug 31, 2017 at 12:51 PM, Stephane Eranian <eranian@xxxxxxxxxx> wrote:
> Hi,
>
> On Thu, Aug 31, 2017 at 10:18 AM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>> On Wed, Aug 23, 2017 at 11:54:15AM +0300, Alexey Budankov wrote:
>>> On 22.08.2017 23:47, Peter Zijlstra wrote:
>>> > On Thu, Aug 10, 2017 at 06:57:43PM +0300, Alexey Budankov wrote:
>>> >> The key thing in the patch is explicit updating of tstamp fields for
>>> >> INACTIVE events in update_event_times().
>>> >
>>> >> @@ -1405,6 +1426,9 @@ static void update_event_times(struct perf_event *event)
>>> >>        event->group_leader->state < PERF_EVENT_STATE_INACTIVE)
>>> >>            return;
>>> >>
>>> >> +  if (event->state == PERF_EVENT_STATE_INACTIVE)
>>> >> +          perf_event_tstamp_update(event);
>>> >> +
>>> >>    /*
>>> >>     * in cgroup mode, time_enabled represents
>>> >>     * the time the event was enabled AND active
>>> >
>>> > But why!? I thought the whole point was to not need to do this.
>>>
>>> update_event_times() is not called from timer interrupt handler
>>> thus it is not on the critical path which is optimized in this patch set.
>>>
>>> But update_event_times() is called in the context of read() syscall so
>>> this is the place where we may update event times for INACTIVE events
>>> instead of timer interrupt.
>>>
>>> Also update_event_times() is called on thread context switch out so
>>> we get event times also updated when the thread migrates to other CPU.
>>>
>>> >
>>> > The thing I outlined earlier would only need to update timestamps when
>>> > events change state and at no other point in time.
>>>
>>> But we still may request times while event is in INACTIVE state
>>> thru read() syscall and event timings need to be up-to-date.
>>
>> Sure, read() also updates.
>>
>> So the below completely rewrites timekeeping (and probably breaks
>> world) but does away with the need to touch events that don't get
>> scheduled.
>>
>> Esp the cgroup stuff is entirely untested since I simply don't know how
>> to operate that. I did run Vince's tests on it, and I think it doesn't
>> regress, but I'm near a migraine so I can't really see straight atm.
>>
>> Vince, Stephane, could you guys have a peek?
>>
> okay, I will run some tests with cgroups on my systems.
>
I ran some cgroups tests, including multiplexing and so far it appears to work
normally.
It is easy to create a cgroup and move a shell into it:
$ mount -t cgroup none /sys/fs/cgroups
$ cd /sys/fs/cgroups/perf_events
$ mkdir memtoy
$ cd memtoy
$ echo $$ >tasks
At this point your shell is part of the cgroup.
Then you can use perf to monitor globally or inside the cgroup:
$ perf stat -a -e cycles,cycles -G memtoy -I 1000 sleep 1000
That monitors cycles on all CPUs twice, once only when a member
of the cgroup memtoy runs, and the other globally.
>> (There's a few other bits in, I'll break up into patches and write
>> comments and Changelogs later, I think its can be split in some 5
>> patches).
>>
>> The basic idea is really simple, we have a single timestamp and
>> depending on the state we update enabled/running. This obviously only
>> requires updates when we change state and when we need up-to-date
>> timestamps (read).
>>
>> No more weird and wonderful mind bending interaction between 3 different
>> timestamps with arcane update rules.
>>
>> ---
>>  include/linux/perf_event.h |  25 +-
>>  kernel/events/core.c       | 551 ++++++++++++++++-----------------------------
>>  2 files changed, 192 insertions(+), 384 deletions(-)
>>
>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>> index 8e22f24ded6a..2a6ae48a1a96 100644
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -485,9 +485,9 @@ struct perf_addr_filters_head {
>>  };
>>
>>  /**
>> - * enum perf_event_active_state - the states of a event
>> + * enum perf_event_state - the states of a event
>>   */
>> -enum perf_event_active_state {
>> +enum perf_event_state {
>>         PERF_EVENT_STATE_DEAD           = -4,
>>         PERF_EVENT_STATE_EXIT           = -3,
>>         PERF_EVENT_STATE_ERROR          = -2,
>> @@ -578,7 +578,7 @@ struct perf_event {
>>         struct pmu                      *pmu;
>>         void                            *pmu_private;
>>
>> -       enum perf_event_active_state    state;
>> +       enum perf_event_state           state;
>>         unsigned int                    attach_state;
>>         local64_t                       count;
>>         atomic64_t                      child_count;
>> @@ -588,26 +588,10 @@ struct perf_event {
>>          * has been enabled (i.e. eligible to run, and the task has
>>          * been scheduled in, if this is a per-task event)
>>          * and running (scheduled onto the CPU), respectively.
>> -        *
>> -        * They are computed from tstamp_enabled, tstamp_running and
>> -        * tstamp_stopped when the event is in INACTIVE or ACTIVE state.
>>          */
>>         u64                             total_time_enabled;
>>         u64                             total_time_running;
>> -
>> -       /*
>> -        * These are timestamps used for computing total_time_enabled
>> -        * and total_time_running when the event is in INACTIVE or
>> -        * ACTIVE state, measured in nanoseconds from an arbitrary point
>> -        * in time.
>> -        * tstamp_enabled: the notional time when the event was enabled
>> -        * tstamp_running: the notional time when the event was scheduled on
>> -        * tstamp_stopped: in INACTIVE state, the notional time when the
>> -        *      event was scheduled off.
>> -        */
>> -       u64                             tstamp_enabled;
>> -       u64                             tstamp_running;
>> -       u64                             tstamp_stopped;
>> +       u64                             tstamp;
>>
>>         /*
>>          * timestamp shadows the actual context timing but it can
>> @@ -699,7 +683,6 @@ struct perf_event {
>>
>>  #ifdef CONFIG_CGROUP_PERF
>>         struct perf_cgroup              *cgrp; /* cgroup event is attach to */
>> -       int                             cgrp_defer_enabled;
>>  #endif
>>
>>         struct list_head                sb_list;
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index 294f1927f944..e968b3eab9c7 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -582,6 +582,70 @@ static inline u64 perf_event_clock(struct perf_event *event)
>>         return event->clock();
>>  }
>>
>> +/*
>> + * XXX comment about timekeeping goes here
>> + */
>> +
>> +static __always_inline enum perf_event_state
>> +__perf_effective_state(struct perf_event *event)
>> +{
>> +       struct perf_event *leader = event->group_leader;
>> +
>> +       if (leader->state <= PERF_EVENT_STATE_OFF)
>> +               return leader->state;
>> +
>> +       return event->state;
>> +}
>> +
>> +static __always_inline void
>> +__perf_update_times(struct perf_event *event, u64 now, u64 *enabled, u64 *running)
>> +{
>> +       enum perf_event_state state = __perf_effective_state(event);
>> +       u64 delta = now - event->tstamp;
>> +
>> +       *enabled = event->total_time_enabled;
>> +       if (state >= PERF_EVENT_STATE_INACTIVE)
>> +               *enabled += delta;
>> +
>> +       *running = event->total_time_running;
>> +       if (state >= PERF_EVENT_STATE_ACTIVE)
>> +               *running += delta;
>> +}
>> +
>> +static void perf_event_update_time(struct perf_event *event)
>> +{
>> +       u64 now = perf_event_time(event);
>> +
>> +       __perf_update_times(event, now, &event->total_time_enabled,
>> +                                       &event->total_time_running);
>> +       event->tstamp = now;
>> +}
>> +
>> +static void perf_event_update_sibling_time(struct perf_event *leader)
>> +{
>> +       struct perf_event *sibling;
>> +
>> +       list_for_each_entry(sibling, &leader->sibling_list, group_entry)
>> +               perf_event_update_time(sibling);
>> +}
>> +
>> +static void
>> +perf_event_set_state(struct perf_event *event, enum perf_event_state state)
>> +{
>> +       if (event->state == state)
>> +               return;
>> +
>> +       perf_event_update_time(event);
>> +       /*
>> +        * If a group leader gets enabled/disabled all its siblings
>> +        * are affected too.
>> +        */
>> +       if ((event->state < 0) ^ (state < 0))
>> +               perf_event_update_sibling_time(event);
>> +
>> +       WRITE_ONCE(event->state, state);
>> +}
>> +
>>  #ifdef CONFIG_CGROUP_PERF
>>
>>  static inline bool
>> @@ -841,40 +905,6 @@ perf_cgroup_set_shadow_time(struct perf_event *event, u64 now)
>>         event->shadow_ctx_time = now - t->timestamp;
>>  }
>>
>> -static inline void
>> -perf_cgroup_defer_enabled(struct perf_event *event)
>> -{
>> -       /*
>> -        * when the current task's perf cgroup does not match
>> -        * the event's, we need to remember to call the
>> -        * perf_mark_enable() function the first time a task with
>> -        * a matching perf cgroup is scheduled in.
>> -        */
>> -       if (is_cgroup_event(event) && !perf_cgroup_match(event))
>> -               event->cgrp_defer_enabled = 1;
>> -}
>> -
>> -static inline void
>> -perf_cgroup_mark_enabled(struct perf_event *event,
>> -                        struct perf_event_context *ctx)
>> -{
>> -       struct perf_event *sub;
>> -       u64 tstamp = perf_event_time(event);
>> -
>> -       if (!event->cgrp_defer_enabled)
>> -               return;
>> -
>> -       event->cgrp_defer_enabled = 0;
>> -
>> -       event->tstamp_enabled = tstamp - event->total_time_enabled;
>> -       list_for_each_entry(sub, &event->sibling_list, group_entry) {
>> -               if (sub->state >= PERF_EVENT_STATE_INACTIVE) {
>> -                       sub->tstamp_enabled = tstamp - sub->total_time_enabled;
>> -                       sub->cgrp_defer_enabled = 0;
>> -               }
>> -       }
>> -}
>> -
>>  /*
>>   * Update cpuctx->cgrp so that it is set when first cgroup event is added and
>>   * cleared when last cgroup event is removed.
>> @@ -973,17 +1003,6 @@ static inline u64 perf_cgroup_event_time(struct perf_event *event)
>>  }
>>
>>  static inline void
>> -perf_cgroup_defer_enabled(struct perf_event *event)
>> -{
>> -}
>> -
>> -static inline void
>> -perf_cgroup_mark_enabled(struct perf_event *event,
>> -                        struct perf_event_context *ctx)
>> -{
>> -}
>> -
>> -static inline void
>>  list_update_cgroup_event(struct perf_event *event,
>>                          struct perf_event_context *ctx, bool add)
>>  {
>> @@ -1396,60 +1415,6 @@ static u64 perf_event_time(struct perf_event *event)
>>         return ctx ? ctx->time : 0;
>>  }
>>
>> -/*
>> - * Update the total_time_enabled and total_time_running fields for a event.
>> - */
>> -static void update_event_times(struct perf_event *event)
>> -{
>> -       struct perf_event_context *ctx = event->ctx;
>> -       u64 run_end;
>> -
>> -       lockdep_assert_held(&ctx->lock);
>> -
>> -       if (event->state < PERF_EVENT_STATE_INACTIVE ||
>> -           event->group_leader->state < PERF_EVENT_STATE_INACTIVE)
>> -               return;
>> -
>> -       /*
>> -        * in cgroup mode, time_enabled represents
>> -        * the time the event was enabled AND active
>> -        * tasks were in the monitored cgroup. This is
>> -        * independent of the activity of the context as
>> -        * there may be a mix of cgroup and non-cgroup events.
>> -        *
>> -        * That is why we treat cgroup events differently
>> -        * here.
>> -        */
>> -       if (is_cgroup_event(event))
>> -               run_end = perf_cgroup_event_time(event);
>> -       else if (ctx->is_active)
>> -               run_end = ctx->time;
>> -       else
>> -               run_end = event->tstamp_stopped;
>> -
>> -       event->total_time_enabled = run_end - event->tstamp_enabled;
>> -
>> -       if (event->state == PERF_EVENT_STATE_INACTIVE)
>> -               run_end = event->tstamp_stopped;
>> -       else
>> -               run_end = perf_event_time(event);
>> -
>> -       event->total_time_running = run_end - event->tstamp_running;
>> -
>> -}
>> -
>> -/*
>> - * Update total_time_enabled and total_time_running for all events in a group.
>> - */
>> -static void update_group_times(struct perf_event *leader)
>> -{
>> -       struct perf_event *event;
>> -
>> -       update_event_times(leader);
>> -       list_for_each_entry(event, &leader->sibling_list, group_entry)
>> -               update_event_times(event);
>> -}
>> -
>>  static enum event_type_t get_event_type(struct perf_event *event)
>>  {
>>         struct perf_event_context *ctx = event->ctx;
>> @@ -1492,6 +1457,8 @@ list_add_event(struct perf_event *event, struct perf_event_context *ctx)
>>         WARN_ON_ONCE(event->attach_state & PERF_ATTACH_CONTEXT);
>>         event->attach_state |= PERF_ATTACH_CONTEXT;
>>
>> +       event->tstamp = perf_event_time(event);
>> +
>>         /*
>>          * If we're a stand alone event or group leader, we go to the context
>>          * list, group events are kept attached to the group so that
>> @@ -1699,8 +1666,6 @@ list_del_event(struct perf_event *event, struct perf_event_context *ctx)
>>         if (event->group_leader == event)
>>                 list_del_init(&event->group_entry);
>>
>> -       update_group_times(event);
>> -
>>         /*
>>          * If event was in error state, then keep it
>>          * that way, otherwise bogus counts will be
>> @@ -1709,7 +1674,7 @@ list_del_event(struct perf_event *event, struct perf_event_context *ctx)
>>          * of the event
>>          */
>>         if (event->state > PERF_EVENT_STATE_OFF)
>> -               event->state = PERF_EVENT_STATE_OFF;
>> +               perf_event_set_state(event, PERF_EVENT_STATE_OFF);
>>
>>         ctx->generation++;
>>  }
>> @@ -1808,38 +1773,24 @@ event_sched_out(struct perf_event *event,
>>                   struct perf_cpu_context *cpuctx,
>>                   struct perf_event_context *ctx)
>>  {
>> -       u64 tstamp = perf_event_time(event);
>> -       u64 delta;
>> +       enum perf_event_state state = PERF_EVENT_STATE_INACTIVE;
>>
>>         WARN_ON_ONCE(event->ctx != ctx);
>>         lockdep_assert_held(&ctx->lock);
>>
>> -       /*
>> -        * An event which could not be activated because of
>> -        * filter mismatch still needs to have its timings
>> -        * maintained, otherwise bogus information is return
>> -        * via read() for time_enabled, time_running:
>> -        */
>> -       if (event->state == PERF_EVENT_STATE_INACTIVE &&
>> -           !event_filter_match(event)) {
>> -               delta = tstamp - event->tstamp_stopped;
>> -               event->tstamp_running += delta;
>> -               event->tstamp_stopped = tstamp;
>> -       }
>> -
>>         if (event->state != PERF_EVENT_STATE_ACTIVE)
>>                 return;
>>
>>         perf_pmu_disable(event->pmu);
>>
>> -       event->tstamp_stopped = tstamp;
>>         event->pmu->del(event, 0);
>>         event->oncpu = -1;
>> -       event->state = PERF_EVENT_STATE_INACTIVE;
>> +
>>         if (event->pending_disable) {
>>                 event->pending_disable = 0;
>> -               event->state = PERF_EVENT_STATE_OFF;
>> +               state = PERF_EVENT_STATE_OFF;
>>         }
>> +       perf_event_set_state(event, state);
>>
>>         if (!is_software_event(event))
>>                 cpuctx->active_oncpu--;
>> @@ -1859,7 +1810,9 @@ group_sched_out(struct perf_event *group_event,
>>                 struct perf_event_context *ctx)
>>  {
>>         struct perf_event *event;
>> -       int state = group_event->state;
>> +
>> +       if (group_event->state != PERF_EVENT_STATE_ACTIVE)
>> +               return;
>>
>>         perf_pmu_disable(ctx->pmu);
>>
>> @@ -1873,7 +1826,7 @@ group_sched_out(struct perf_event *group_event,
>>
>>         perf_pmu_enable(ctx->pmu);
>>
>> -       if (state == PERF_EVENT_STATE_ACTIVE && group_event->attr.exclusive)
>> +       if (group_event->attr.exclusive)
>>                 cpuctx->exclusive = 0;
>>  }
>>
>> @@ -1893,6 +1846,11 @@ __perf_remove_from_context(struct perf_event *event,
>>  {
>>         unsigned long flags = (unsigned long)info;
>>
>> +       if (ctx->is_active & EVENT_TIME) {
>> +               update_context_time(ctx);
>> +               update_cgrp_time_from_cpuctx(cpuctx);
>> +       }
>> +
>>         event_sched_out(event, cpuctx, ctx);
>>         if (flags & DETACH_GROUP)
>>                 perf_group_detach(event);
>> @@ -1955,14 +1913,17 @@ static void __perf_event_disable(struct perf_event *event,
>>         if (event->state < PERF_EVENT_STATE_INACTIVE)
>>                 return;
>>
>> -       update_context_time(ctx);
>> -       update_cgrp_time_from_event(event);
>> -       update_group_times(event);
>> +       if (ctx->is_active & EVENT_TIME) {
>> +               update_context_time(ctx);
>> +               update_cgrp_time_from_cpuctx(cpuctx);
>> +       }
>> +
>>         if (event == event->group_leader)
>>                 group_sched_out(event, cpuctx, ctx);
>>         else
>>                 event_sched_out(event, cpuctx, ctx);
>> -       event->state = PERF_EVENT_STATE_OFF;
>> +
>> +       perf_event_set_state(event, PERF_EVENT_STATE_OFF);
>>  }
>>
>>  /*
>> @@ -2019,8 +1980,7 @@ void perf_event_disable_inatomic(struct perf_event *event)
>>  }
>>
>>  static void perf_set_shadow_time(struct perf_event *event,
>> -                                struct perf_event_context *ctx,
>> -                                u64 tstamp)
>> +                                struct perf_event_context *ctx)
>>  {
>>         /*
>>          * use the correct time source for the time snapshot
>> @@ -2048,9 +2008,9 @@ static void perf_set_shadow_time(struct perf_event *event,
>>          * is cleaner and simpler to understand.
>>          */
>>         if (is_cgroup_event(event))
>> -               perf_cgroup_set_shadow_time(event, tstamp);
>> +               perf_cgroup_set_shadow_time(event, event->tstamp);
>>         else
>> -               event->shadow_ctx_time = tstamp - ctx->timestamp;
>> +               event->shadow_ctx_time = event->tstamp - ctx->timestamp;
>>  }
>>
>>  #define MAX_INTERRUPTS (~0ULL)
>> @@ -2063,7 +2023,6 @@ event_sched_in(struct perf_event *event,
>>                  struct perf_cpu_context *cpuctx,
>>                  struct perf_event_context *ctx)
>>  {
>> -       u64 tstamp = perf_event_time(event);
>>         int ret = 0;
>>
>>         lockdep_assert_held(&ctx->lock);
>> @@ -2077,7 +2036,7 @@ event_sched_in(struct perf_event *event,
>>          * is visible.
>>          */
>>         smp_wmb();
>> -       WRITE_ONCE(event->state, PERF_EVENT_STATE_ACTIVE);
>> +       perf_event_set_state(event, PERF_EVENT_STATE_ACTIVE);
>>
>>         /*
>>          * Unthrottle events, since we scheduled we might have missed several
>> @@ -2089,26 +2048,19 @@ event_sched_in(struct perf_event *event,
>>                 event->hw.interrupts = 0;
>>         }
>>
>> -       /*
>> -        * The new state must be visible before we turn it on in the hardware:
>> -        */
>> -       smp_wmb();
>> -
>>         perf_pmu_disable(event->pmu);
>>
>> -       perf_set_shadow_time(event, ctx, tstamp);
>> +       perf_set_shadow_time(event, ctx);
>>
>>         perf_log_itrace_start(event);
>>
>>         if (event->pmu->add(event, PERF_EF_START)) {
>> -               event->state = PERF_EVENT_STATE_INACTIVE;
>> +               perf_event_set_state(event, PERF_EVENT_STATE_INACTIVE);
>>                 event->oncpu = -1;
>>                 ret = -EAGAIN;
>>                 goto out;
>>         }
>>
>> -       event->tstamp_running += tstamp - event->tstamp_stopped;
>> -
>>         if (!is_software_event(event))
>>                 cpuctx->active_oncpu++;
>>         if (!ctx->nr_active++)
>> @@ -2132,8 +2084,6 @@ group_sched_in(struct perf_event *group_event,
>>  {
>>         struct perf_event *event, *partial_group = NULL;
>>         struct pmu *pmu = ctx->pmu;
>> -       u64 now = ctx->time;
>> -       bool simulate = false;
>>
>>         if (group_event->state == PERF_EVENT_STATE_OFF)
>>                 return 0;
>> @@ -2163,27 +2113,13 @@ group_sched_in(struct perf_event *group_event,
>>         /*
>>          * Groups can be scheduled in as one unit only, so undo any
>>          * partial group before returning:
>> -        * The events up to the failed event are scheduled out normally,
>> -        * tstamp_stopped will be updated.
>> -        *
>> -        * The failed events and the remaining siblings need to have
>> -        * their timings updated as if they had gone thru event_sched_in()
>> -        * and event_sched_out(). This is required to get consistent timings
>> -        * across the group. This also takes care of the case where the group
>> -        * could never be scheduled by ensuring tstamp_stopped is set to mark
>> -        * the time the event was actually stopped, such that time delta
>> -        * calculation in update_event_times() is correct.
>> +        * The events up to the failed event are scheduled out normally.
>>          */
>>         list_for_each_entry(event, &group_event->sibling_list, group_entry) {
>>                 if (event == partial_group)
>> -                       simulate = true;
>> +                       break;
>>
>> -               if (simulate) {
>> -                       event->tstamp_running += now - event->tstamp_stopped;
>> -                       event->tstamp_stopped = now;
>> -               } else {
>> -                       event_sched_out(event, cpuctx, ctx);
>> -               }
>> +               event_sched_out(event, cpuctx, ctx);
>>         }
>>         event_sched_out(group_event, cpuctx, ctx);
>>
>> @@ -2225,46 +2161,11 @@ static int group_can_go_on(struct perf_event *event,
>>         return can_add_hw;
>>  }
>>
>> -/*
>> - * Complement to update_event_times(). This computes the tstamp_* values to
>> - * continue 'enabled' state from @now, and effectively discards the time
>> - * between the prior tstamp_stopped and now (as we were in the OFF state, or
>> - * just switched (context) time base).
>> - *
>> - * This further assumes '@event->state == INACTIVE' (we just came from OFF) and
>> - * cannot have been scheduled in yet. And going into INACTIVE state means
>> - * '@event->tstamp_stopped = @now'.
>> - *
>> - * Thus given the rules of update_event_times():
>> - *
>> - *   total_time_enabled = tstamp_stopped - tstamp_enabled
>> - *   total_time_running = tstamp_stopped - tstamp_running
>> - *
>> - * We can insert 'tstamp_stopped == now' and reverse them to compute new
>> - * tstamp_* values.
>> - */
>> -static void __perf_event_enable_time(struct perf_event *event, u64 now)
>> -{
>> -       WARN_ON_ONCE(event->state != PERF_EVENT_STATE_INACTIVE);
>> -
>> -       event->tstamp_stopped = now;
>> -       event->tstamp_enabled = now - event->total_time_enabled;
>> -       event->tstamp_running = now - event->total_time_running;
>> -}
>> -
>>  static void add_event_to_ctx(struct perf_event *event,
>>                                struct perf_event_context *ctx)
>>  {
>> -       u64 tstamp = perf_event_time(event);
>> -
>>         list_add_event(event, ctx);
>>         perf_group_attach(event);
>> -       /*
>> -        * We can be called with event->state == STATE_OFF when we create with
>> -        * .disabled = 1. In that case the IOC_ENABLE will call this function.
>> -        */
>> -       if (event->state == PERF_EVENT_STATE_INACTIVE)
>> -               __perf_event_enable_time(event, tstamp);
>>  }
>>
>>  static void ctx_sched_out(struct perf_event_context *ctx,
>> @@ -2496,28 +2397,6 @@ perf_install_in_context(struct perf_event_context *ctx,
>>  }
>>
>>  /*
>> - * Put a event into inactive state and update time fields.
>> - * Enabling the leader of a group effectively enables all
>> - * the group members that aren't explicitly disabled, so we
>> - * have to update their ->tstamp_enabled also.
>> - * Note: this works for group members as well as group leaders
>> - * since the non-leader members' sibling_lists will be empty.
>> - */
>> -static void __perf_event_mark_enabled(struct perf_event *event)
>> -{
>> -       struct perf_event *sub;
>> -       u64 tstamp = perf_event_time(event);
>> -
>> -       event->state = PERF_EVENT_STATE_INACTIVE;
>> -       __perf_event_enable_time(event, tstamp);
>> -       list_for_each_entry(sub, &event->sibling_list, group_entry) {
>> -               /* XXX should not be > INACTIVE if event isn't */
>> -               if (sub->state >= PERF_EVENT_STATE_INACTIVE)
>> -                       __perf_event_enable_time(sub, tstamp);
>> -       }
>> -}
>> -
>> -/*
>>   * Cross CPU call to enable a performance event
>>   */
>>  static void __perf_event_enable(struct perf_event *event,
>> @@ -2535,14 +2414,12 @@ static void __perf_event_enable(struct perf_event *event,
>>         if (ctx->is_active)
>>                 ctx_sched_out(ctx, cpuctx, EVENT_TIME);
>>
>> -       __perf_event_mark_enabled(event);
>> +       perf_event_set_state(event, PERF_EVENT_STATE_INACTIVE);
>>
>>         if (!ctx->is_active)
>>                 return;
>>
>>         if (!event_filter_match(event)) {
>> -               if (is_cgroup_event(event))
>> -                       perf_cgroup_defer_enabled(event);
>>                 ctx_sched_in(ctx, cpuctx, EVENT_TIME, current);
>>                 return;
>>         }
>> @@ -2862,18 +2739,10 @@ static void __perf_event_sync_stat(struct perf_event *event,
>>          * we know the event must be on the current CPU, therefore we
>>          * don't need to use it.
>>          */
>> -       switch (event->state) {
>> -       case PERF_EVENT_STATE_ACTIVE:
>> +       if (event->state == PERF_EVENT_STATE_ACTIVE)
>>                 event->pmu->read(event);
>> -               /* fall-through */
>>
>> -       case PERF_EVENT_STATE_INACTIVE:
>> -               update_event_times(event);
>> -               break;
>> -
>> -       default:
>> -               break;
>> -       }
>> +       perf_event_update_time(event);
>>
>>         /*
>>          * In order to keep per-task stats reliable we need to flip the event
>> @@ -3110,10 +2979,6 @@ ctx_pinned_sched_in(struct perf_event_context *ctx,
>>                 if (!event_filter_match(event))
>>                         continue;
>>
>> -               /* may need to reset tstamp_enabled */
>> -               if (is_cgroup_event(event))
>> -                       perf_cgroup_mark_enabled(event, ctx);
>> -
>>                 if (group_can_go_on(event, cpuctx, 1))
>>                         group_sched_in(event, cpuctx, ctx);
>>
>> @@ -3121,10 +2986,8 @@ ctx_pinned_sched_in(struct perf_event_context *ctx,
>>                  * If this pinned group hasn't been scheduled,
>>                  * put it in error state.
>>                  */
>> -               if (event->state == PERF_EVENT_STATE_INACTIVE) {
>> -                       update_group_times(event);
>> -                       event->state = PERF_EVENT_STATE_ERROR;
>> -               }
>> +               if (event->state == PERF_EVENT_STATE_INACTIVE)
>> +                       perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
>>         }
>>  }
>>
>> @@ -3146,10 +3009,6 @@ ctx_flexible_sched_in(struct perf_event_context *ctx,
>>                 if (!event_filter_match(event))
>>                         continue;
>>
>> -               /* may need to reset tstamp_enabled */
>> -               if (is_cgroup_event(event))
>> -                       perf_cgroup_mark_enabled(event, ctx);
>> -
>>                 if (group_can_go_on(event, cpuctx, can_add_hw)) {
>>                         if (group_sched_in(event, cpuctx, ctx))
>>                                 can_add_hw = 0;
>> @@ -3541,7 +3400,7 @@ static int event_enable_on_exec(struct perf_event *event,
>>         if (event->state >= PERF_EVENT_STATE_INACTIVE)
>>                 return 0;
>>
>> -       __perf_event_mark_enabled(event);
>> +       perf_event_set_state(event, PERF_EVENT_STATE_INACTIVE);
>>
>>         return 1;
>>  }
>> @@ -3590,12 +3449,6 @@ static void perf_event_enable_on_exec(int ctxn)
>>                 put_ctx(clone_ctx);
>>  }
>>
>> -struct perf_read_data {
>> -       struct perf_event *event;
>> -       bool group;
>> -       int ret;
>> -};
>> -
>>  static int __perf_event_read_cpu(struct perf_event *event, int event_cpu)
>>  {
>>         u16 local_pkg, event_pkg;
>> @@ -3613,64 +3466,6 @@ static int __perf_event_read_cpu(struct perf_event *event, int event_cpu)
>>         return event_cpu;
>>  }
>>
>> -/*
>> - * Cross CPU call to read the hardware event
>> - */
>> -static void __perf_event_read(void *info)
>> -{
>> -       struct perf_read_data *data = info;
>> -       struct perf_event *sub, *event = data->event;
>> -       struct perf_event_context *ctx = event->ctx;
>> -       struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
>> -       struct pmu *pmu = event->pmu;
>> -
>> -       /*
>> -        * If this is a task context, we need to check whether it is
>> -        * the current task context of this cpu.  If not it has been
>> -        * scheduled out before the smp call arrived.  In that case
>> -        * event->count would have been updated to a recent sample
>> -        * when the event was scheduled out.
>> -        */
>> -       if (ctx->task && cpuctx->task_ctx != ctx)
>> -               return;
>> -
>> -       raw_spin_lock(&ctx->lock);
>> -       if (ctx->is_active) {
>> -               update_context_time(ctx);
>> -               update_cgrp_time_from_event(event);
>> -       }
>> -
>> -       update_event_times(event);
>> -       if (event->state != PERF_EVENT_STATE_ACTIVE)
>> -               goto unlock;
>> -
>> -       if (!data->group) {
>> -               pmu->read(event);
>> -               data->ret = 0;
>> -               goto unlock;
>> -       }
>> -
>> -       pmu->start_txn(pmu, PERF_PMU_TXN_READ);
>> -
>> -       pmu->read(event);
>> -
>> -       list_for_each_entry(sub, &event->sibling_list, group_entry) {
>> -               update_event_times(sub);
>> -               if (sub->state == PERF_EVENT_STATE_ACTIVE) {
>> -                       /*
>> -                        * Use sibling's PMU rather than @event's since
>> -                        * sibling could be on different (eg: software) PMU.
>> -                        */
>> -                       sub->pmu->read(sub);
>> -               }
>> -       }
>> -
>> -       data->ret = pmu->commit_txn(pmu);
>> -
>> -unlock:
>> -       raw_spin_unlock(&ctx->lock);
>> -}
>> -
>>  static inline u64 perf_event_count(struct perf_event *event)
>>  {
>>         return local64_read(&event->count) + atomic64_read(&event->child_count);
>> @@ -3733,63 +3528,81 @@ int perf_event_read_local(struct perf_event *event, u64 *value)
>>         return ret;
>>  }
>>
>> -static int perf_event_read(struct perf_event *event, bool group)
>> +struct perf_read_data {
>> +       struct perf_event *event;
>> +       bool group;
>> +       int ret;
>> +};
>> +
>> +static void __perf_event_read(struct perf_event *event,
>> +                             struct perf_cpu_context *cpuctx,
>> +                             struct perf_event_context *ctx,
>> +                             void *data)
>>  {
>> -       int event_cpu, ret = 0;
>> +       struct perf_read_data *prd = data;
>> +       struct pmu *pmu = event->pmu;
>> +       struct perf_event *sibling;
>>
>> -       /*
>> -        * If event is enabled and currently active on a CPU, update the
>> -        * value in the event structure:
>> -        */
>> -       if (event->state == PERF_EVENT_STATE_ACTIVE) {
>> -               struct perf_read_data data = {
>> -                       .event = event,
>> -                       .group = group,
>> -                       .ret = 0,
>> -               };
>> +       if (ctx->is_active & EVENT_TIME) {
>> +               update_context_time(ctx);
>> +               update_cgrp_time_from_cpuctx(cpuctx);
>> +       }
>>
>> -               event_cpu = READ_ONCE(event->oncpu);
>> -               if ((unsigned)event_cpu >= nr_cpu_ids)
>> -                       return 0;
>> +       perf_event_update_time(event);
>> +       if (prd->group)
>> +               perf_event_update_sibling_time(event);
>>
>> -               preempt_disable();
>> -               event_cpu = __perf_event_read_cpu(event, event_cpu);
>> +       if (event->state != PERF_EVENT_STATE_ACTIVE)
>> +               return;
>>
>> +       if (!prd->group) {
>> +               pmu->read(event);
>> +               prd->ret = 0;
>> +               return;
>> +       }
>> +
>> +       pmu->start_txn(pmu, PERF_PMU_TXN_READ);
>> +
>> +       pmu->read(event);
>> +       list_for_each_entry(sibling, &event->sibling_list, group_entry) {
>> +               if (sibling->state == PERF_EVENT_STATE_ACTIVE) {
>> +                       /*
>> +                        * Use sibling's PMU rather than @event's since
>> +                        * sibling could be on different (eg: software) PMU.
>> +                        */
>> +                       sibling->pmu->read(sibling);
>> +               }
>> +       }
>> +
>> +       prd->ret = pmu->commit_txn(pmu);
>> +}
>> +
>> +static int perf_event_read(struct perf_event *event, bool group)
>> +{
>> +       struct perf_read_data prd = {
>> +               .event = event,
>> +               .group = group,
>> +               .ret = 0,
>> +       };
>> +
>> +       if (event->ctx->task) {
>> +               event_function_call(event, __perf_event_read, &prd);
>> +       } else {
>>                 /*
>> -                * Purposely ignore the smp_call_function_single() return
>> -                * value.
>> -                *
>> -                * If event_cpu isn't a valid CPU it means the event got
>> -                * scheduled out and that will have updated the event count.
>> -                *
>> -                * Therefore, either way, we'll have an up-to-date event count
>> -                * after this.
>> -                */
>> -               (void)smp_call_function_single(event_cpu, __perf_event_read, &data, 1);
>> -               preempt_enable();
>> -               ret = data.ret;
>> -       } else if (event->state == PERF_EVENT_STATE_INACTIVE) {
>> -               struct perf_event_context *ctx = event->ctx;
>> -               unsigned long flags;
>> -
>> -               raw_spin_lock_irqsave(&ctx->lock, flags);
>> -               /*
>> -                * may read while context is not active
>> -                * (e.g., thread is blocked), in that case
>> -                * we cannot update context time
>> +                * For uncore events (which are per definition per-cpu)
>> +                * allow a different read CPU from event->cpu.
>>                  */
>> -               if (ctx->is_active) {
>> -                       update_context_time(ctx);
>> -                       update_cgrp_time_from_event(event);
>> -               }
>> -               if (group)
>> -                       update_group_times(event);
>> -               else
>> -                       update_event_times(event);
>> -               raw_spin_unlock_irqrestore(&ctx->lock, flags);
>> +               struct event_function_struct efs = {
>> +                       .event = event,
>> +                       .func = __perf_event_read,
>> +                       .data = &prd,
>> +               };
>> +               int cpu = __perf_event_read_cpu(event, event->cpu);
>> +
>> +               cpu_function_call(cpu, event_function, &efs);
>>         }
>>
>> -       return ret;
>> +       return prd.ret;
>>  }
>>
>>  /*
>> @@ -4388,7 +4201,7 @@ static int perf_release(struct inode *inode, struct file *file)
>>         return 0;
>>  }
>>
>> -u64 perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running)
>> +static u64 __perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running)
>>  {
>>         struct perf_event *child;
>>         u64 total = 0;
>> @@ -4416,6 +4229,18 @@ u64 perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running)
>>
>>         return total;
>>  }
>> +
>> +u64 perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running)
>> +{
>> +       struct perf_event_context *ctx;
>> +       u64 count;
>> +
>> +       ctx = perf_event_ctx_lock(event);
>> +       count = __perf_event_read_value(event, enabled, running);
>> +       perf_event_ctx_unlock(event, ctx);
>> +
>> +       return count;
>> +}
>>  EXPORT_SYMBOL_GPL(perf_event_read_value);
>>
>>  static int __perf_read_group_add(struct perf_event *leader,
>> @@ -4431,6 +4256,8 @@ static int __perf_read_group_add(struct perf_event *leader,
>>         if (ret)
>>                 return ret;
>>
>> +       raw_spin_lock_irqsave(&ctx->lock, flags);
>> +
>>         /*
>>          * Since we co-schedule groups, {enabled,running} times of siblings
>>          * will be identical to those of the leader, so we only publish one
>> @@ -4453,8 +4280,6 @@ static int __perf_read_group_add(struct perf_event *leader,
>>         if (read_format & PERF_FORMAT_ID)
>>                 values[n++] = primary_event_id(leader);
>>
>> -       raw_spin_lock_irqsave(&ctx->lock, flags);
>> -
>>         list_for_each_entry(sub, &leader->sibling_list, group_entry) {
>>                 values[n++] += perf_event_count(sub);
>>                 if (read_format & PERF_FORMAT_ID)
>> @@ -4518,7 +4343,7 @@ static int perf_read_one(struct perf_event *event,
>>         u64 values[4];
>>         int n = 0;
>>
>> -       values[n++] = perf_event_read_value(event, &enabled, &running);
>> +       values[n++] = __perf_event_read_value(event, &enabled, &running);
>>         if (read_format & PERF_FORMAT_TOTAL_TIME_ENABLED)
>>                 values[n++] = enabled;
>>         if (read_format & PERF_FORMAT_TOTAL_TIME_RUNNING)
>> @@ -4897,8 +4722,7 @@ static void calc_timer_values(struct perf_event *event,
>>
>>         *now = perf_clock();
>>         ctx_time = event->shadow_ctx_time + *now;
>> -       *enabled = ctx_time - event->tstamp_enabled;
>> -       *running = ctx_time - event->tstamp_running;
>> +       __perf_update_times(event, ctx_time, enabled, running);
>>  }
>>
>>  static void perf_event_init_userpage(struct perf_event *event)
>> @@ -10516,7 +10340,7 @@ perf_event_exit_event(struct perf_event *child_event,
>>         if (parent_event)
>>                 perf_group_detach(child_event);
>>         list_del_event(child_event, child_ctx);
>> -       child_event->state = PERF_EVENT_STATE_EXIT; /* is_event_hup() */
>> +       perf_event_set_state(child_event, PERF_EVENT_STATE_EXIT); /* is_event_hup() */
>>         raw_spin_unlock_irq(&child_ctx->lock);
>>
>>         /*
>> @@ -10754,7 +10578,7 @@ inherit_event(struct perf_event *parent_event,
>>               struct perf_event *group_leader,
>>               struct perf_event_context *child_ctx)
>>  {
>> -       enum perf_event_active_state parent_state = parent_event->state;
>> +       enum perf_event_state parent_state = parent_event->state;
>>         struct perf_event *child_event;
>>         unsigned long flags;
>>
>> @@ -11090,6 +10914,7 @@ static void __perf_event_exit_context(void *__info)
>>         struct perf_event *event;
>>
>>         raw_spin_lock(&ctx->lock);
>> +       ctx_sched_out(ctx, cpuctx, EVENT_TIME);
>>         list_for_each_entry(event, &ctx->event_list, event_entry)
>>                 __perf_remove_from_context(event, cpuctx, ctx, (void *)DETACH_GROUP);
>>         raw_spin_unlock(&ctx->lock);