Re: [PATCH 34/37] perf hists browser: Support flat callchains

From: Namhyung Kim
Date: Wed Nov 25 2015 - 16:04:34 EST

Next message: Stephen Boyd: "[PATCH v2 1/3] ARM: qcom: Make an option for qcom clocksource platforms"
Previous message: Tony Lindgren: "Re: [PATCH 5/5] arm: boot: store ATAGs structure into DT "/chosen/linux,atags" entry"
In reply to: Arnaldo Carvalho de Melo: "Re: [PATCH 34/37] perf hists browser: Support flat callchains"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi Arnaldo,

On Tue, Nov 24, 2015 at 11:10:25PM -0300, Arnaldo Carvalho de Melo wrote:
> Em Tue, Nov 24, 2015 at 10:34:18PM -0300, Arnaldo Carvalho de Melo escreveu:
> > Em Wed, Nov 25, 2015 at 10:26:08AM +0900, Namhyung Kim escreveu:
> > > On Tue, Nov 24, 2015 at 12:45:51PM -0200, Arnaldo Carvalho de Melo wrote:
> > > > Em Tue, Nov 24, 2015 at 02:27:08PM +0900, Namhyung Kim escreveu:
> > > > > On Mon, Nov 23, 2015 at 04:16:48PM +0100, Frederic Weisbecker wrote:
> > > > > > On Thu, Nov 19, 2015 at 02:53:20PM -0300, Arnaldo Carvalho de Melo wrote:
> > > > > > > From: Namhyung Kim <namhyung@xxxxxxxxxx>
> > > > > > [...]
> > > > > Thus I simply copied callchain lists in parents to leaf nodes. Yes,
> > > > > it will consume some memory but can simplify the code.
> > > >
> > > > I haven't done any measuring, but I'm noticing that 'perf top -g' is
> > > > showing more warnings about not being able to process events fast enough
> > > > and so ends up losing events, I tried with --max-stack 16 and it helped,
> > > > this is just a heads up.
> > >
> > > OK, but it seems that it's not related to this patch since this patch
> > > only affects flat or folded callchain mode.
> >
> > Well, doesn't this patch makes some of the involved data structures
> > larger, thus putting more pressure on the L1 cache, etc? It may well be
> > related, but we need to measure.
> >
> > > > Perhaps my workstation workloads are gettning deeper callchains over
> > > > time, but perhaps this is the cost of processing callchains that is
> > > > increasing, I need to stop and try to quantify this.
> > > >
> > > > We really need to look at reducing the overhead of processing
> > > > callchains.
> > >
> > > Right, but with my multi-thread work, I realized that perf is getting
> > > heavier recently. I guess it's mostly due to the atomic refcount
> > > work. I need to get back to the multi-thread work..
> >
> > We really need to measure this ;-)
>
> So, something strange, if I use:
>
> [acme@zoo linux]$ cat ~/bin/allmod
> rm -rf ../build/allmodconfig/ ; mkdir ../build/allmodconfig/ ; make O=../build/allmodconfig/ allmodconfig ; make -j32 O=../build/allmodconfig
> [acme@zoo linux]$
>
> To generate background load, I don't see that much this:
>
> + 8.55% 8.49% libc-2ââWarning!âââââââââââââââââââââââââââââââââââââââââââââââ â
> + 7.08% 6.98% perf âEvents are being lost, check IO/CPU overload! â â
> + 6.84% 0.04% perf â â â
> + 6.01% 0.09% perf âYou may want to run 'perf' using a RT scheduler policy:â â
> + 5.26% 0.13% [kerneâ âtâ
> + 4.96% 1.50% perf â perf top -r 80 â â
> + 4.76% 3.58% perf â â â
> + 4.69% 0.05% perf âOr reduce the sampling frequency. â â
>
> Its with a low loadavg:
>
> [acme@zoo linux]$ cat /proc/loadavg
> 0.75 0.79 0.64 3/549 21259
>
> That it pops up :-\
>
> If I take a snapshot with 'P'
>
> # head -40 perf.hist.0
> + 21.43% 21.09% libc-2.20.so [.] _int_malloc
> + 19.49% 0.00% perf [.] cmd_top
> + 19.46% 0.02% perf [.] perf_top__mmap_read_idx
> + 19.03% 0.06% perf [.] hist_entry_iter__add
> + 16.46% 1.85% perf [.] iter_add_next_cumulative_entry
> + 12.09% 11.98% libc-2.20.so [.] free
> + 10.68% 10.61% libc-2.20.so [.] __libc_calloc
> + 9.61% 0.09% perf [.] hists__decay_entries
> + 8.92% 8.85% libc-2.20.so [.] malloc_consolidate
> + 8.17% 6.33% perf [.] free_callchain_node
> + 7.94% 0.09% perf [.] hist_entry__delete
> + 6.22% 0.03% perf [.] callchain_append
> + 6.20% 6.11% perf [.] append_chain_children
> + 5.44% 1.50% perf [.] __hists__add_entry
> + 4.34% 0.14% [kernel] [k] entry_SYSCALL_64_fastpath
> + 3.69% 3.67% perf [.] sort__dso_cmp
> + 3.12% 0.20% perf [.] hists__output_resort
> + 2.88% 0.00% [unknown] [.] 0x6d86258d4c544155
> + 2.88% 0.00% libc-2.20.so [.] __libc_start_main
> + 2.88% 0.00% perf [.] main
> + 2.88% 0.00% perf [.] run_builtin
> + 2.66% 0.00% libpthread-2.20.so [.] start_thread
> + 2.66% 0.00% perf [.] display_thread_tui
> + 2.66% 0.00% perf [.] perf_evlist__tui_browse_hists
> + 2.66% 0.00% perf [.] perf_evsel__hists_browse
> + 2.49% 0.07% [kernel] [k] sys_futex
> + 2.42% 0.06% [kernel] [k] do_futex
> 2.24% 0.00% perf [.] perf_top__sort_new_samples
> + 1.92% 0.51% perf [.] hists__collapse_resort
> + 1.87% 1.86% perf [.] hpp__sort_overhead_acc
> + 1.69% 0.09% libc-2.20.so [.] __lll_unlock_wake_private
> + 1.45% 1.44% perf [.] hpp__nop_cmp
> 1.45% 1.43% perf [.] rb_erase
> + 1.44% 1.42% perf [.] __sort__hpp_cmp
> 1.31% 0.16% libc-2.20.so [.] __lll_lock_wait_private
> 1.18% 0.19% [kernel] [k] futex_wake
> + 1.13% 1.13% perf [.] sort__sym_cmp
> 1.11% 0.02% [kernel] [k] schedule
> 1.09% 0.22% [kernel] [k] __schedule
> 0.99% 0.08% [kernel] [k] futex_wait
>
> So its quite a lot of mallocs, i.e. just do a 'perf top -g' and wait a
> bit, malloc goes on bubbling up to the top, about the same time it
> starts showing that popup screen telling that we're losing events.
>
> If I use --no-children, to see if there is a difference, using either
> --call-graph caller or callee, it doesn't get more than about 1%.
>
> Ok, now I tried with "perf top --call-graph caller" i.e. with
> --children, and looked at the _int_malloc callchains I get really long,
> bogus callchains, see below. That explains why I don't lose events when
> I use --max-stack.
>
> I'll have to stop now, and I put the full perf.hist.1 at
> http://vger.kernel.org/~acme/perf/perf.hist.1.xz
>
> - Arnaldo
>
> [root@zoo ~]# head -60 perf.hist.1
> - 17.92% 17.10% libc-2.20.so [.] _int_malloc
> + 112.80% _int_malloc
> 11.14% 0x41bf5118
> 0x41bf5068
> 0x41bf5118
> 0x41bf5068
> 0x41bf5118
> 0x41bf5068
> 0x41bf5118
> 0x41bf5068
> 0x41bf5118
> 0x41bf5068
> 0x41bf5118
> 0x41bf5068
> 0x41bf5118
> 0x41bf5068
> 0x41bf5118
> 0x41bf5068
> 0x41bf5118
> 0x41bf5068
> 0x41bf5118
> 0x41bf5068
> 0x41bf5118
> 0x41bf5068
> 0x41bf5118
> 0x41bf5068
> 0x41bf5118
> 0x41bf5068
> 0x41bf5118
> 0x41bf5068
> 0x41bf5118
> 0x41bf5068
> 0x41bf5118
> 0x41bf5068
> 0x41bf5118
> 0x41bf5068
> 0x41bf5118
> 0x41bf5068
> 0x41bf5118
> 0x41bf5068
> 0x41bf5118
> 0x41bf5068
> 0x41bf5118
> 0x41bf5068
> 0x41bf5118
> 0x41bf5068
> 0x41bf5118
> 0x41bf5068
> 0x41bf5118
> 0x41bf5068
> 0x41bf5118
> 0x41bf5068
> 0x41bf5118
> 0x41bf5068
> 0x41bf5118
> 0x41bf5068
> 0x41bf5118
> 0x41bf5068
> 0x41bf5118
> 0x41bf5068

Hmm.. we should cut off the loop in the broken callchains. And it'd
be good to apply --hide-unresolved to callchains as well. I'll take a
look at it.

Thanks,
Namhyung
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Stephen Boyd: "[PATCH v2 1/3] ARM: qcom: Make an option for qcom clocksource platforms"
Previous message: Tony Lindgren: "Re: [PATCH 5/5] arm: boot: store ATAGs structure into DT "/chosen/linux,atags" entry"
In reply to: Arnaldo Carvalho de Melo: "Re: [PATCH 34/37] perf hists browser: Support flat callchains"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]