Re: [PATCH 1/2] perf_events: add support for per-cpu per-cgroupmonitoring (v5)

From: Stephane Eranian
Date: Thu Nov 25 2010 - 16:32:29 EST


On Thu, Nov 25, 2010 at 4:02 PM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> On Thu, 2010-11-25 at 15:51 +0100, Stephane Eranian wrote:
>>
>>
>> On Thu, Nov 25, 2010 at 12:20 PM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>> Â Â Â Â On Thu, 2010-11-18 at 12:40 +0200, Stephane Eranian wrote:
>> Â Â Â Â > @@ -919,6 +945,10 @@ static inline void perf_event_task_sched_in(struct task_struct *task)
>> Â Â Â Â > Âstatic inline
>> Â Â Â Â > Âvoid perf_event_task_sched_out(struct task_struct *task, struct task_struct *next)
>> Â Â Â Â > Â{
>> Â Â Â Â > +#ifdef CONFIG_CGROUPS
>> Â Â Â Â > + Â Â Â atomic_t *cgroup_events = &__get_cpu_var(perf_cgroup_events);
>> Â Â Â Â > + Â Â Â COND_STMT(cgroup_events, perf_cgroup_switch(task, next));
>> Â Â Â Â > +#endif
>> Â Â Â Â > Â Â Â Â COND_STMT(&perf_task_events, __perf_event_task_sched_out(task, next));
>> Â Â Â Â > Â}
>>
>>
>> Â Â Â Â I don't think that'll actually work, the jump label stuff needs a static
>> Â Â Â Â address.
>>
>> I did not know that.
>
> Yeah, its unfortunate the fallback code doesn't mandate this :/
>
>
>> Â Â Â Â Why not simply: s/perf_task_events/perf_sched_events/ and
>> Â Â Â Â increment it
>> Â Â Â Â for cgroup events as well?
>>
>> But you would need to demultiplex. that's not because perf_sched_events is
>> set that you want BOTH perf_cgroup_switch() AND perf_event_task_sched_out().
>
> The main purpose of the jump-label stuff is to optimize the function
> call and conditional into the perf code away, the moment we a function
> call we might as well do everything, at that point its only a single
> conditional.
>
> Jump labels are supposed to work like (they don't actually work like
> this yet):
>
> my_func:
> Â Âasm-foo
> addr_of_nop:
> Â Ânop5
> after_nop:
> Â Âmore-asm-foo
> Â Âiret
>
> out_of_line:
> Â Âdo-special-foo
> Â Âjmp after_nop
>
>
> We then keep a section of tuples:
>
> __jump_labels:
>
> Â&perf_task_events,addr_of_nop
>
> Then when we flip perf_task_events from 0 -> !0 we rewrite the nop5 at
> addr_of_nop to "jmp out_of_line" (5 bytes on x86, hence nop5), or the
> reverse on !0 -> 0.
>
>
> So 1) we need the 'key' (&perf_task_events) to be a static address
> because the compiler needs to place the address in the special section
> -- otherwise we can never find the nop location again, this also means
> per-cpu variables don't make sense, there's only 1 copy of the text.
>
> and 2) the moment we take the out-of-line branch we incur the icache hit
> and already set up a call, so optimizing away another conditional at the
> cost of an extra function call doesn't really make sense.
>
>
Ok, I understand. Thanks for the explanation.

So perf_sched_events would indicate that there may be some perf
ctxsw work to do. It would set as soon as there is at least one
event defined and not just a per-thread event.

BTW, I suspect, isn't there a bug with
PERF_COUNT_SW_CONTEXT_SWITCHES not being updated
if you don't have at least one per-thread event. You need to
hoist that perf_sw_event() into the header file before the COND().

Then, __perf_event_task_sched_out() would be modified
to do some more checks.

__perf_event_task_sched_out(task, next)
{
int ctxn;

perf_sw_event(PERF_COUNT_SW_CONTEXT_SWITCHES, 1, 1, NULL, 0);

for_each_task_context_nr(ctxn)
perf_event_context_sched_out(task, ctxn, next);
}


void __perf_event_task_sched_out(struct task_struct *task,
struct task_struct *next)
{
int ctxn;
int cgroup_events;

perf_sw_event(PERF_COUNT_SW_CONTEXT_SWITCHES, 1, 1, NULL, 0);

croup_events = &__get_cpu_var(perf_cgroup_events);
if (cgroup_events)
perf_cgroup_switch(task, next);

if (perf_task_events)
for_each_task_context_nr(ctxn)
perf_event_context_sched_out(task, ctxn, next);
}

That implies you need to maintain:
- perf_sched_events: number of active events system-wide
- perf_task_events: number of per-thread events system-wide
- cgroup_events: per-cpu variable representing the number of cgroup
events on a CPU

Is that what you are thinking?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/