Re: [PATCH]: perf/core: addressing 4x slowdown during per-process profiling of STREAM benchmark on Intel Xeon Phi

From: Alexey Budankov
Date: Mon May 29 2017 - 12:41:21 EST


On 29.05.2017 18:29, Peter Zijlstra wrote:
On Mon, May 29, 2017 at 05:22:54PM +0200, Peter Zijlstra wrote:
On Mon, May 29, 2017 at 04:43:09PM +0300, Alexey Budankov wrote:
On 29.05.2017 15:03, Alexander Shishkin wrote:
Alexey Budankov <alexey.budankov@xxxxxxxxxxxxxxx> writes:

+ } else if (event->cpu > node_event->cpu) {
+ node = &((*node)->rb_right);
+ } else {
+ list_add_tail(&event->group_list_entry,
+ &node_event->group_list);

So why is this better than simply having per-cpu event lists plus one
for per-thread events?

Good question. Choice of data structure and layout depends on the operations
applied to the data so keeping groups as a tree simplifies and improves the
implementation in terms of scalability and performance. Please ask more if
any.

Since these lists are per context, and each task can have a context,
you'd end up with per-task-per-cpu memory, which is something we'd like
to avoid (some archs have very limited per-cpu memory space etc..).

Aw, yeah. Memory consumption does matter in the kernel space.


Also, we'd like to have that tree for other reasons, like for instance
that heterogeneous PMU crud ARM has. Also, with a tree we can easier do
time based round-robin scheduling,


Oh and in general multi-PMU stuff, aside from hetero PMU becomes much
easier.