Re: [PATCH] perf/core: Fix endless multiplex timer

From: Liang, Kan
Date: Tue Mar 03 2020 - 20:40:14 EST




On 3/3/2020 4:08 PM, Peter Zijlstra wrote:
On Tue, Mar 03, 2020 at 12:28:19PM -0800, kan.liang@xxxxxxxxxxxxxxx wrote:
From: Kan Liang <kan.liang@xxxxxxxxxxxxxxx>

A lot of time are spent in writing uncore MSRs even though no perf is
running.

4.66% swapper [kernel.kallsyms] [k] native_write_msr
|
--4.56%--native_write_msr
|
|--1.68%--snbep_uncore_msr_enable_box
| perf_mux_hrtimer_handler
| __hrtimer_run_queues
| hrtimer_interrupt
| smp_apic_timer_interrupt
| apic_timer_interrupt
| cpuidle_enter_state
| cpuidle_enter
| do_idle
| cpu_startup_entry
| start_kernel
| secondary_startup_64

The root cause is that multiplex timer was not stopped when perf stat
finished.
Current perf relies on rotate_necessary to determine whether the
multiplex timer should be stopped. The variable only be reset in
ctx_sched_out(), which is not enough for system-wide event.
Perf stat invokes PERF_EVENT_IOC_DISABLE to stop system-wide event
before closing it.
perf_ioctl()
perf_event_disable()
event_sched_out()
The rotate_necessary will never be reset.

The issue is a generic issue, not just impact the uncore.

Check whether we had been multiplexing. If yes, reset rotate_necessary
for the last active event in __perf_event_disable().

Fixes: fd7d55172d1e ("perf/cgroups: Don't rotate events for cgroups unnecessarily")
Reported-by: Andi Kleen <ak@xxxxxxxxxxxxxxx>
Reviewed-by: Andi Kleen <ak@xxxxxxxxxxxxxxx>
Signed-off-by: Kan Liang <kan.liang@xxxxxxxxxxxxxxx>
---
kernel/events/core.c | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 3f1f77de7247..50688de56181 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2242,6 +2242,16 @@ static void __perf_event_disable(struct perf_event *event,
update_cgrp_time_from_event(event);
}
+ /*
+ * If we had been multiplexing,
+ * stop the rotations for the last active event.
+ * Only need to check system wide events.
+ * For task events, it will be checked in ctx_sched_out().
+ */
+ if ((cpuctx->ctx.nr_events != cpuctx->ctx.nr_active) &&
+ (cpuctx->ctx.nr_active == 1))
+ cpuctx->ctx.rotate_necessary = 0;
+
if (event == event->group_leader)
group_sched_out(event, cpuctx, ctx);
else


I'm thinking this is wrong.

That is, yes, this fixes the observed problem, but it also misses at
least one other site. Which seems to suggest we ought to take a
different approach.

But even with that; I wonder if the actual condition isn't wrong.
Suppose the event was exclusive, and other events weren't scheduled
because of that. Then you disable the one exclusive event _and_ kill
rotation, so then nothing else will ever get on.

So what I think was supposed to happen is rotation killing itself;
rotation will schedule out the context -- which will clear the flag, and
then schedule the thing back in -- which will set the flag again when
needed.

Now, that isn't happening, and I think I see why, because when we drop
to !nr_active, we terminate ctx_sched_out() before we get to clearing
the flag, oops!

So how about something like this?

---
diff --git a/kernel/events/core.c b/kernel/events/core.c
index e453589da97c..7947bd3271a9 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2182,6 +2182,7 @@ __perf_remove_from_context(struct perf_event *event,
if (!ctx->nr_events && ctx->is_active) {
ctx->is_active = 0;
+ ctx->rotate_necessary = 0;
if (ctx->task) {
WARN_ON_ONCE(cpuctx->task_ctx != ctx);
cpuctx->task_ctx = NULL;


The patch can fix the observed problem with uncore PMU.
But it cannot fix all the cases with core PMU, especially when NMI watchdog is enabled.
Because the ctx->nr_events never be 0 with NMI watchdog enabled.

For example,
$echo 1 > /proc/sys/kernel/nmi_watchdog
$sudo perf stat -e ref-cycles,ref-cycles -a sleep 1
Performance counter stats for 'system wide':
549,432,900 ref-cycles (50.01%)
545,358,967 ref-cycles (49.99%)
1.002367894 seconds time elapsed
$perf probe --add perf_rotate_context
$perf record -e probe:perf_rotate_context -aR -g sleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.896 MB perf.data (1750 samples) ]
$perf report --stdio
100.00% 100.00% (ffffffffa5604e84)
|
--99.83%--0xffffffffa54000d4
start_secondary
cpu_startup_entry
do_idle
call_cpuidle
cpuidle_enter
cpuidle_enter_state
apic_timer_interrupt
smp_apic_timer_interrupt
hrtimer_interrupt
__hrtimer_run_queues
perf_mux_hrtimer_handler

Thanks,
Kan

@@ -3074,15 +3075,15 @@ static void ctx_sched_out(struct perf_event_context *ctx,
is_active ^= ctx->is_active; /* changed bits */
- if (!ctx->nr_active || !(is_active & EVENT_ALL))
- return;
-
/*
* If we had been multiplexing, no rotations are necessary, now no events
* are active.
*/
ctx->rotate_necessary = 0;
+ if (!ctx->nr_active || !(is_active & EVENT_ALL))
+ return;
+
perf_pmu_disable(ctx->pmu);
if (is_active & EVENT_PINNED) {
list_for_each_entry_safe(event, tmp, &ctx->pinned_active, active_list)