Re: [PATCH]: perf/core: addressing 4x slowdown during per-process profiling of STREAM benchmark on Intel Xeon Phi

From: Alexey Budankov
Date: Wed Jun 14 2017 - 06:07:54 EST


On 30.05.2017 11:29, Alexander Shishkin wrote:
Alexey Budankov <alexey.budankov@xxxxxxxxxxxxxxx> writes:

On 29.05.2017 15:03, Alexander Shishkin wrote:
Alexey Budankov <alexey.budankov@xxxxxxxxxxxxxxx> writes:

Here (above the function) you could include a comment describing what
happens when this is called, locking considerations, etc.

I can put the short description from the initial thread message here.
Would it be sufficient?

Sure, this is where API descriptions fit better than in commit messages.



+static int
+perf_cpu_tree_insert(struct rb_root *tree, struct perf_event *event)
+{
+ struct rb_node **node;
+ struct rb_node *parent;
+
+ if (!tree || !event)
+ return 0;

I don't think this should be happening, should it? And either way you
probably don't want to return 0 here, unless you're using !0 for
success.

As you might notice already, currently return codes of the tree API are
not checked all other the implementation. I suggest replacing that int
error code by void and simplify the stuff.

Your call. But I'd still either drop the redundant checks or wrap them
in WARN_ON_ONCE().

Ok. WARN_ON_ONCE() then.




+
+ node = &tree->rb_node;
+ parent = *node;
+
+ while (*node) {
+ struct perf_event *node_event = container_of(*node,
+ struct perf_event, group_node);
+
+ parent = *node;
+
+ if (event->cpu < node_event->cpu) {
+ node = &((*node)->rb_left);

this would be the same as node = &parent->rb_left, right?

Yes, that is right.


Please ask more.

Side note: between commit message, comments and the actual code, in an
ideal situation one doesn't have to 'ask' anything, because everything
is already clear. Not the case here.

node is the leaf node and parent is the parent of the
node at the end of cycle. We need the both to insert a new node into a
tree.

Not sure I understand. You'd still have both.



+ } else if (event->cpu > node_event->cpu) {
+ node = &((*node)->rb_right);
+ } else {
+ list_add_tail(&event->group_list_entry,
+ &node_event->group_list);

So why is this better than simply having per-cpu event lists plus one
for per-thread events?

Good question. Choice of data structure and layout depends on the
operations applied to the data so keeping groups as a tree simplifies
and improves the implementation in terms of scalability and performance.
Please ask more if any.

Please be more specific on how scalability and performance are
improved. In general, try to avoid vagues statements like "this is
better for performance".

Accepted. Peter already provided more specifics on this. Thanks.


Thanks,
--
Alex