Re: [PATCH] perf_events: fix and improve x86 event scheduling

From: Stephane Eranian
Date: Mon Nov 14 2011 - 07:55:14 EST


Robert,



On Thu, Nov 10, 2011 at 7:03 PM, Robert Richter <robert.richter@xxxxxxx> wrote:
> On 07.11.11 06:01:49, Stephane Eranian wrote:
>>
>> Major rewrite of the x86 event scheduling routine.
>> The routine is shared between AMD and Intel.
>>
>> The existing code had an issue with certain combinations
>> of constraints. As Robert Richter @ AMD demonstrated on
>> AMD Bulldozer:
>>
>> e0 [3,0]
>> e1 [2:0]
>> e2 [2:0]
>> e3 [2:0]
>>
>> With this list of events, the existing scheduling algorithm
>> would never schedule e3. That's because it would always choose
>> counter 0 for e0:
>> Â e0 -> 0
>> Â e1 -> 1
>> Â e2 -> 2
>> Â e3 -> X
>>
>> Robert Richter proposed a fix to the algorithm which was based
>> on tagging constraints that overlap. He was using rollbacks to
>> simulate recursion.
>>
>> We propose an alternate solution which is simpler, faster. This
>> is a small modification to the existing algorithm. It adds some
>> smart in how a counter is chosen for a given event. The first
>
> Posting a link to my patch set for reference:
>
> Âhttps://lkml.org/lkml/2011/9/10/51
>
> What are the reasons to your alternate solution? Is it recursion, code
> complexity, or a use case it does not fit in? I see recursion as the
> main concern with my patch set, but its impact is known and limited.
> Anyway, a algorithm without recursion would be generally better.
>
>> available counter is not systemtically selected. Instead, the
>> algorithm checks how the constraints between the events overlap.
>> For each counter, it assigns a weight which is the number of events
>> which it can count for the event set. For each event, the algorithm
>> assigns the counter with the smallest weight first.
>
> But this algorithm does not work for all cases and does not solve the
> problem in general. Your idea to have weights for counters might be
> the right approach.
>
> E.g. the algorithm fails with (all weights are the same):
>
> Â Âc0 c1 c2
> Âe0 x   x
> Âe1 Â Âx Âx
> Âe2 x Âx
>
> ... leading to:
>
> Âe0 -> c0
> Âe1 -> C1
> Âe3 -> X
>
> You basically have to recalculate the weights after you had assigned a
> counter.
>
Yes, it does not yield an assignment which maximizes the PMU usage.

I have been talking with co-workers experts in operational research
about our problem. They all pointed to me to the max flow algorithm from
Ford-Fulkerson (search for it on Wikipedia). I think it solves the complexity
and recursion problems. My understanding is that the complexity is also
more under control.

I started experimenting with this algorithm. I will report in a few days.
One thing for sure, it does provide the 'maximized' answer for your
example above and also for your initial BD example.


> But even if we do this, I am still not sure if that would cover all
> cases.
>
>>
>> In the example above, the new algoritm assigns:
>> Â e0 -> 3
>> Â e1 -> 0
>> Â e2 -> 1
>> Â e3 -> 2
>>
>> Because:
>> Â counter 0, weight = 4
>> Â counter 1, weight = 3
>> Â counter 2, weight = 3
>> Â counter 3, weight = 1
>>
>> We maximize PMU usage and ensure all events are measured.
>>
>> The patch also restructures the code to separate scheduling of
>> constrained vs. unconstrained events. An unconstrained event is
>> one that can be programmed into any of the generic counters. For
>> those, we now use the simplest algorihtm possible: use next free
>> counter. There is now a fast path which is beneficial on
>> processors such as AMD64 Fam10h.
>
> I don't see the need for a differentiation between constraint and
> unconstraint events. If loops are optimized in the constraint path
> there is not much overhead anymore. This could be done by specifying
> min and max limits for ranges. Special cases (unconstraint events,
> fixed counter, etc.) make the code more complex. I don't think a good
> algorithm will need this.
>
>> @@ -553,42 +530,179 @@ int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
>> Â Â Â if (x86_pmu.num_counters_fixed)
>> Â Â Â Â Â Â Â wmax++;
>>
>> - Â Â for (w = 1, num = n; num && w <= wmax; w++) {
>> - Â Â Â Â Â Â /* for each event */
>> - Â Â Â Â Â Â for (i = 0; num && i < n; i++) {
>> - Â Â Â Â Â Â Â Â Â Â c = constraints[i];
>> - Â Â Â Â Â Â Â Â Â Â hwc = &cpuc->event_list[i]->hw;
>> + Â Â /*
>> + Â Â Â* assign from most constrained to least constrained
>> + Â Â Â*/
>> + Â Â for (w = 1, num = num_c; num && w <= wmax; w++) {
>> + Â Â Â Â Â Â /* for each constrained event */
>> + Â Â Â Â Â Â for (i = 0, e = c_events; i < num_c; i++, e++) {
>> +
>> + Â Â Â Â Â Â Â Â Â Â map_idx = (*e)->hw.map_idx;
>> + Â Â Â Â Â Â Â Â Â Â c = constraints[map_idx];
>>
>> Â Â Â Â Â Â Â Â Â Â Â if (c->weight != w)
>> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â continue;
>>
>> + Â Â Â Â Â Â Â Â Â Â min_wcnt = INT_MAX;
>
> Should be X86_PMC_IDX_MAX.
>
>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>> index 1e9ebe5..2605244 100644
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -564,6 +564,7 @@ struct hw_perf_event {
>>            int       idx;
>>            int       last_cpu;
>> Â Â Â Â Â Â Â Â Â Â Â struct hw_perf_event_extra extra_reg;
>> +           int       map_idx;
>
> This is not the right place. It is for all architectures but actually
> locally only used. A local array in x86_schedule_events() would work
> too.
>
> -Robert
>
>> Â Â Â Â Â Â Â };
>> Â Â Â Â Â Â Â struct { /* software */
>> Â Â Â Â Â Â Â Â Â Â Â struct hrtimer Âhrtimer;
>>
>
> --
> Advanced Micro Devices, Inc.
> Operating System Research Center
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/