Re: [PATCH v5 0/2] perf: use hrtimer for event multiplexing

From: Frederic Weisbecker
Date: Fri Mar 22 2013 - 09:54:22 EST


2013/3/22 Stephane Eranian <eranian@xxxxxxxxxx>:
> The current scheme of using the timer tick was fine
> for per-thread events. However, it was causing
> bias issues in system-wide mode (including for
> uncore PMUs). Event groups would not get their
> fair share of runtime on the PMU. With tickless
> kernels, if a core is idle there is no timer tick,
> and thus no event rotation (multiplexing). However,
> there are events (especially uncore events) which do
> count even though cores are asleep.
>
> This patch changes the timer source for multiplexing.
> It introduces a per-cpu hrtimer. The advantage is that
> even when the core goes idle, it will come back to
> service the hrtimer, thus multiplexing on system-wide
> events works much better.
>
> In order to minimize the impact of the hrtimer, it
> is turned on and off on demand. When the PMU on
> a CPU is overcommitted, the hrtimer is activated.
> It is stopped when the PMU is not overcommitted.
>
> In order for this to work properly with HOTPLUG_CPU,
> we had to change the order of initialization in
> start_kernel() such that hrtimer_init() is run
> before perf_event_init().
>
> The second patch provide a sysctl control to
> adjust the multiplexing interval. Unit is
> milliseconds.
>
> Here is a simple before/after example with
> two event groups which do require multiplexing.
> This is done in system-wide mode on an idle
> system. What matters here is the scaling factor
> in [] in not the total counts.
>
> Before:
>
> # perf stat -a -e ref-cycles,ref-cycles sleep 10
> Performance counter stats for 'sleep 10':
> 34,319,545 ref-cycles [56.51%]
> 31,917,229 ref-cycles [43.50%]
>
> 10.000827569 seconds time elapsed
>
> After:
> # perf stat -a -e ref-cycles,ref-cycles sleep 10
> Performance counter stats for 'sleep 10':
> 11,144,822,193 ref-cycles [50.00%]
> 11,103,760,513 ref-cycles [50.00%]
>
> 10.000672946 seconds time elapsed
>
> What matters here is the 50% not the actual
> count. Ref-cycles runs only on one fixed counter.
> With two instances, each should get 50% of the PMU
> which is now true. This helps mitigate the error
> introduced by the scaling.
>
> In this second version of the patchset, we now
> have the hrtimer_interval per PMU instance. The
> tunable is in /sys/devices/XXX/mux_interval_ms,
> where XXX is the name of the PMU instance. Due
> to initialization changes of each hrtimer, we
> had to introduce hrtimer_init_cpu() to initialize
> a hrtimer from another CPU.
>
> In the 3rd version, we simplify the code a bit
> by using hrtimer_active(). We stopped using
> the rotation_list for perf_cpu_hrtimer_cancel().
> We also fix an intialization problem.
>
> In the 4th version, we rebase to 3.8.0-rc7 and
> we kept SW event on the rotation list which is
> now used only for unthrottling. We also renamed
> the sysfs tunable to perf_event_mux_interval_ms
> to be more consistent with the existing sysctl
> entries.
>
> In the 5th version, we modified the code such
> that a new hrtimer interval is applied immediately
> to any active hrtimer as suggested by Jiri Olsa.
> Also got rid of the CPU notifier for hrtimer, it
> was useless and unreliable. The code is rebased to
> 3.9.0-rc3.
>
> Signed-off-by: Stephane Eranian <eranian@xxxxxxxxxx>

And I have to say this patch is going to be very useful for the full
dynticks tree. We are happy to get rid of that tick hook.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/