Re: [RFC v2] perf: Rewrite core context handling
From: Ravi Bangoria
Date: Wed Aug 24 2022 - 01:08:00 EST
On 23-Aug-22 2:27 PM, Peter Zijlstra wrote:
> On Tue, Aug 02, 2022 at 11:46:32AM +0530, Ravi Bangoria wrote:
>> On 13-Jun-22 8:13 PM, Peter Zijlstra wrote:
>>> On Mon, Jun 13, 2022 at 04:35:11PM +0200, Peter Zijlstra wrote:
>
>>>> +static void ctx_pinned_sched_in(struct perf_event_context *ctx, struct pmu *pmu)
>>>> {
>>>> + struct perf_event_pmu_context *pmu_ctx;
>>>> int can_add_hw = 1;
>>>>
>>>> - if (ctx != &cpuctx->ctx)
>>>> - cpuctx = NULL;
>>>> -
>>>> - visit_groups_merge(cpuctx, &ctx->pinned_groups,
>>>> - smp_processor_id(),
>>>> - merge_sched_in, &can_add_hw);
>>>> + if (pmu) {
>>>> + visit_groups_merge(ctx, &ctx->pinned_groups,
>>>> + smp_processor_id(), pmu,
>>>> + merge_sched_in, &can_add_hw);
>>>> + } else {
>>>> + /*
>>>> + * XXX: This can be optimized for per-task context by calling
>>>> + * visit_groups_merge() only once with:
>>>> + * 1) pmu=NULL
>>>> + * 2) Ignoring pmu in perf_event_groups_cmp() when it's NULL
>>>> + * 3) Making can_add_hw a per-pmu variable
>>>> + *
>>>> + * Though, it can not be opimized for per-cpu context because
>>>> + * per-cpu rb-tree consist of pmu-subtrees and pmu-subtrees
>>>> + * consist of cgroup-subtrees. i.e. a cgroup events of same
>>>> + * cgroup but different pmus are seperated out into respective
>>>> + * pmu-subtrees.
>>>> + */
>>>> + list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
>>>> + can_add_hw = 1;
>>>> + visit_groups_merge(ctx, &ctx->pinned_groups,
>>>> + smp_processor_id(), pmu_ctx->pmu,
>>>> + merge_sched_in, &can_add_hw);
>>>> + }
>>>> + }
>>>> }
>>>
>>> I'm not sure I follow.. task context can have multiple PMUs just the
>>> same as CPU context can, that's more or less the entire point of the
>>> patch.
>>
>> Current rbtree key is {cpu, cgroup_id, group_idx}. However, effective key for
>> task specific context is {cpu, group_idx} because cgroup_id is always 0. And
>> effective key for cpu specific context is {cgroup_id, group_idx} because cpu
>> is same for entire rbtree.
>>
>> With New design, rbtree key will be {cpu, pmu, cgroup_id, group_idx}. But as
>> explained above, effective key for task specific context will be {cpu, pmu,
>> group_idx}. Thus, we can handle pmu=NULL in visit_groups_merge(), same as you
>> did in the very first RFC[1]. (This may make things more complicated though
>> because we might also need to increase heap size to accommodate all pmu events
>> in single heap. Current heap size is 2 for task specific context, which is
>> sufficient if we iterate over all pmus).
>>
>> Same optimization won't work for cpu specific context because, it's effective
>> key would be {pmu, cgroup_id, group_idx} i.e. each pmu subtree is made up of
>> cgroup subtrees.
>
> Agreed, new order is: {cpu, pmu, cgroup_id, group_idx}
>
> Event scheduling looks at the {cpu, pmu, cgroup_id} subtree to find the
> leftmost group_idx event to schedule next.
>
> However, since cgroup events are per-cpu events, per-task events will
> always have cgroup=NULL. Resulting in the subtrees:
>
> {-1, pmu, NULL} and {cpu, pmu, NULL}
>
> Which is what the code does, it iterates ctx->pmu_ctx_list to find all
> @pmu values and then for each does the schedule dance.
>
> Now, I suppose making that:
>
> {-1, NULL, NULL}, {cpu, NULL, NULL}
>
> could work, but wouldn't iterating the the tree be more expensive than
> just finding the sub-trees as we do now?
pmu=NULL can be used while scheduling entire context. We can just traverse
through all pmu events of both cpu subtrees.
>
> You also talk about extending extending the heap, which I read like
> doing the heap-merge over:
>
> {-1, pmu0, NULL}, {-1, pmu1, NULL}, ...
> {cpu, pmu0, NULL}, ...
>
> But that doesn't make sense, the schedule dance is per-pmu.
>
> Or am I just still not getting it?
Ok. Let's not complicate the design. We can go with current approach of
iterating over all pmus in the first phase and think about optimizing it
later.
Thanks,
Ravi