Re: [PATCH/RFC v2] perf core: Allow setting up max frame stack depth via sysctl

From: Alexei Starovoitov
Date: Mon Apr 25 2016 - 17:59:58 EST


On Mon, Apr 25, 2016 at 05:17:50PM -0300, Arnaldo Carvalho de Melo wrote:
> Em Mon, Apr 25, 2016 at 01:06:48PM -0700, Alexei Starovoitov escreveu:
> > On Mon, Apr 25, 2016 at 04:22:29PM -0300, Arnaldo Carvalho de Melo wrote:
> > > Em Mon, Apr 25, 2016 at 01:27:06PM -0300, Arnaldo Carvalho de Melo escreveu:
> > > > Em Mon, Apr 25, 2016 at 01:14:25PM -0300, Arnaldo Carvalho de Melo escreveu:
> > > > > Em Fri, Apr 22, 2016 at 03:18:08PM -0700, Alexei Starovoitov escreveu:
> > > > > > right... and looking into it further, realized that the patch is broken,
> > > > > > since get_callchain_entry() is doing:
> > > > > > return &entries->cpu_entries[cpu][*rctx];
> > > > > > whereas it should be dynamic offset based on sysctl_perf_event_max_stack*8
> > > > > > So definitely needs another respin.
> > >
> > > > struct perf_callchain_entry {
> > > > __u64 nr;
> > > > __u64 ip[0]; /* /proc/sys/kernel/perf_event_max_stack */
> > > > };
> > >
> > > > But perf_callchain_entry->ip is not a pointer... Got it ;-\
> > >
> > > This is what I am building now, a patch on top of the previous, that
> > > will be combined and sent as v3, if I don't find any more funnies:
> > >
> > > diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
> > > index 6fe77349fa9d..40657892a7e0 100644
> > > --- a/kernel/events/callchain.c
> > > +++ b/kernel/events/callchain.c
> > > @@ -20,11 +20,10 @@ struct callchain_cpus_entries {
> > >
> > > int sysctl_perf_event_max_stack __read_mostly = PERF_MAX_STACK_DEPTH;
> > >
> > > -static size_t perf_callchain_entry__sizeof(void)
> > > -{
> > > - return sizeof(struct perf_callchain_entry) +
> > > - sizeof(__u64) * sysctl_perf_event_max_stack;
> > > -}
> > > +#define __perf_callchain_entry_size(entries) \
> > > + (sizeof(struct perf_callchain_entry) + sizeof(__u64) * entries)
> > > +
> > > +static size_t perf_callchain_entry_size __read_mostly = __perf_callchain_entry_size(PERF_MAX_STACK_DEPTH);
> > >
> > > static DEFINE_PER_CPU(int, callchain_recursion[PERF_NR_CONTEXTS]);
> > > static atomic_t nr_callchain_events;
> > > @@ -81,7 +80,7 @@ static int alloc_callchain_buffers(void)
> > > if (!entries)
> > > return -ENOMEM;
> > >
> > > - size = perf_callchain_entry__sizeof() * PERF_NR_CONTEXTS;
> > > + size = perf_callchain_entry_size * PERF_NR_CONTEXTS;
> > >
> > > for_each_possible_cpu(cpu) {
> > > entries->cpu_entries[cpu] = kmalloc_node(size, GFP_KERNEL,
> > > @@ -155,7 +154,8 @@ static struct perf_callchain_entry *get_callchain_entry(int *rctx)
> > >
> > > cpu = smp_processor_id();
> > >
> > > - return &entries->cpu_entries[cpu][*rctx];
> > > + return (((void *)&entries->cpu_entries[cpu][0]) +
> > > + (*rctx * perf_callchain_entry_size));
> >
> > I think the following would be easier to read:
> >
> > + return (void *)entries->cpu_entries[cpu] +
> > + *rctx * perf_callchain_entry_size;
>
> Well, I thought that multiline expressions required parentheses, to make
> them easier to read for someone, maybe Ingo? ;-)
>
> Whatever, both generate the same result, really want me to change this?

I was mainly complaining about unnecessary &..[0]

> > ?
> > if didn't mixed up the ordering...
>
> If you are not sure, then its not clearer, huh? ;-P

matter of taste. No strong opinion

> > and probably we could do the math on the spot instead of introducing
> > perf_callchain_entry_size global variable?
>
> I was trying to avoid the calc for each alloc, just doing it when we
> change that number via the sysctl, probably not that big a deal, do you
> really think we should do the math per-alloc instead of
> per-sysctl-changing?

The math is trivial:
#define __perf_callchain_entry_size(entries) \
(sizeof(struct perf_callchain_entry) + sizeof(__u64) * entries)
shift and add after load is almost the same speed as load, since
integer multiply right after is dominating the cost.
whereas updating two global variables can potentially cause
race conditions that need to be analyzed.