Re: [GIT PULL] perf changes for v3.12

From: Namhyung Kim
Date: Tue Sep 10 2013 - 04:06:18 EST


Hi,

On Thu, 5 Sep 2013 14:42:44 +0200, Frederic Weisbecker wrote:
> On Thu, Sep 05, 2013 at 12:56:39PM +0200, Ingo Molnar wrote:
>>
>> (Cc:-ed Frederic and Namhyung as well, it's about bad overhead in
>> tools/perf/util/hist.c.)
>>
>> * Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>>
>> > On Tue, Sep 3, 2013 at 6:29 AM, Ingo Molnar <mingo@xxxxxxxxxx> wrote:
>> > >
>> > > Please pull the latest perf-core-for-linus git tree from:
>> >
>> > I don't think this is new at all, but I just tried to do a perf
>> > record/report of "make -j64 test" on git:
>> >
>> > It's a big perf.data file (1.6G), but after it has done the
>> > "processing time ordered events" thing it results in:
>> >
>> > ââWarning:ââââââââââââââââââââââââââââââââââââ
>> > âProcessed 8672030 events and lost 71 chunks!â
>> > âCheck IO/CPU overload! â
>> > â â
>> > â â
>> > âPress any key... â
>> > ââââââââââââââââââââââââââââââââââââââââââââââ
>> >
>> > and then it just hangs using 100% CPU time. Pressing any key doesn't
>> > do anything.
>> >
>> > It may well still be *doing* something, and maybe it will come back
>> > some day with results. But it sure doesn't show any indication that it
>> > will.
>> >
>> > Try this (in a current git source tree: note, by "git" I actually mean
>> > git itself, not some random git repository)::
>> >
>> > perf record -g -e cycles:pp make -j64 test >& out
>> > perf report
>> >
>> > maybe you can reproduce it.
>>
>> I managed to reproduce it on a 32-way box via:
>>
>> perf record -g make -j64 bzImage >/dev/null 2>&1
>>
>> It's easier to debug it without the TUI:
>>
>> perf --no-pages report --stdio
>>
>> It turns out that even with a 400 MB perf.data the 'perf report' call will
>> eventually finish - here it ran for almost half an hour(!) on a fast box.
>>
>> Arnaldo, the large overhead is in hists__collapse_resort(), in particular
>> it's doing append_chain_children() 99% of the time:
>>
>> - 99.74% perf perf [.] append_chain_children â
>> - append_chain_children â
>> - 99.76% merge_chain_branch â
>> - merge_chain_branch â
>> + 98.04% hists__collapse_resort â
>> + 1.96% merge_chain_branch â
>> + 0.05% perf perf [.] merge_chain_branch â
>> + 0.03% perf libc-2.17.so [.] _int_free â
>> + 0.03% perf libc-2.17.so [.] __libc_calloc â
>> + 0.02% perf [kernel.kallsyms] [k] account_user_time â
>> + 0.02% perf libc-2.17.so [.] _int_malloc â
>>
>> It seems to be stuck in hists__collapse_resort().
>>
>> In particular the overhead arises because the following loop in
>> append_chain_children():
>>
>> /* lookup in childrens */
>> chain_for_each_child(rnode, root) {
>> unsigned int ret = append_chain(rnode, cursor, period);
>>
>> Reaches very long counts and the algorithm gets quadratic (at least). The
>> child count reaches over 100,000 entries in the end (!).
>>
>> I don't think the high child count in itself is anomalous: a kernel build
>> generates thousands of processes, tons of symbol ranges and tens of
>> millions of call chain entries.
>>
>> So I think what we need here is to speed up the lookup: put children into
>> a secondary, ->pos,len indexed range-rbtree and do a binary search instead
>> of a linear search over 100,000 child entries ... or something like that.
>
> You're right it's worth trying.
>
> At least it might give better results for such high scale callchain trees.
> I'll see what I can come up with.

I justed converted it to a normal rbtree and the total processing time went
down from 380s to 20s! I couldn't understand how I can use the
range-rbtree in this case so probably there's a room for further
enhancement. I'll send the patches soon.

>
>>
>> Btw., a side note, append_chain() is a rather confusing function in
>> itself, with logic-inversion gems like:
>>
>> if (!found)
>> found = true;
>
> The check is pointless yeah, I'll remove that.
>
>>
>> All that should be cleaned up as well I guess.
>>
>> The 'IO overload' message appears to be a separate, unrelated bug, it just
>> annoyingly does not get refreshed away in the TUI before
>> hists__collapse_resort() is called, and there's also no progress bar for
>> the hists__collapse_resort() pass, so to the user it all looks like a
>> deadlock.
>>
>> So there's at least two bugs here:
>>
>> - the bad overhead in hists__collapse_resort()
>>
>> - bad usability if hists__collapse_resort() takes more than 1 second to finish
>
> Also IIUC, collapsing/merging hists is only used for comm hists merging, due to
> set_task_comm after exec.
>
> Perhaps we can do better to anticipate the comm of a process based on tid/pid, fork
> and comm events? This way we can avoid late collapses/merges. In the best case we
> could get rid of collapses entirely and that would be some nice fresh air for util/hist.c
>
> And ideally, the comm should be associated to a lifetime as a thread can change
> its comm anytime.

Right. I also thought about why the separate collapsing stage is
needed. Maybe we can collect hist entries that have same comm at insert
time. One problem I can imagine is that the target thread changes its
comm after collecting some hist entries. In this case we should look up
another thread which has same old comm and pass the entries to it. But
we don't have information that which entries are belong to a certain
thread so for now it'll require full traversal of hist entries. If we
add the info to threads, I think we can get rid of collapses entirely.

Thanks,
Namhyung
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/