Re: [PATCH v2]: perf/core: addressing 4x slowdown during per-process, profiling of STREAM benchmark on Intel Xeon Phi

From: Alexey Budankov
Date: Wed Jun 14 2017 - 07:27:22 EST


On 01.06.2017 0:33, David Carrillo-Cisneros wrote:
On Sat, May 27, 2017 at 4:19 AM, Alexey Budankov
<alexey.budankov@xxxxxxxxxxxxxxx> wrote:
Motivation:

The issue manifests like 4x slowdown when profiling single thread STREAM
benchmark on Intel Xeon Phi running RHEL7.2 (Intel MPSS distribution).
Perf profiling is done in per-process mode and involves about 30 core
events. In case the benchmark is OpenMP based and runs under profiling in
272 threads the overhead becomes even more dramatic: 512.144s against
36.192s (with this patch).

How long does it take without any perf monitoring? Could you provide
more details about the benchmark? how many CPUs are being monitored?

STREAM runs about 10 seconds without Perf monitoring on Intel KNL in 272 threads (272 CPUs are monitored). More on the benchmark is here: https://www.cs.virginia.edu/stream/


SNIP
different from the one executing the handler. Additionally for every
filtered out group group_sched_out() updates tstamps values to the current
interrupt time. This updating work is now done only once by
update_context_time() called by ctx_sched_out() before cpu groups
iteration.

I don't see this. e.g.:

It is done here:

@@ -1383,6 +1384,9 @@ static void update_context_time(struct perf_event_context *ctx)

ctx->time += now - ctx->timestamp;
ctx->timestamp = now;
+
+ ctx->tstamp_data.running += ctx->time - ctx->tstamp_data.stopped;
+ ctx->tstamp_data.stopped = ctx->time;
}

Unscheduled events tstamps are updated all at once in update_context_time() called from ctx_sched_out().

in your patch task_ctx_sched_out calls ctx_sched_out with mux == 0,
that path does the exact same thing as before your patch. >
I understand why you want to move the event's times to a different
structure and keep a pointer in the event, but I don't see that you
are avoiding the update of the times of unscheduled events.


static u64 perf_event_time(struct perf_event *event)
@@ -1424,16 +1428,16 @@ static void update_event_times(struct perf_event
*event)
else if (ctx->is_active)
run_end = ctx->time;
else
- run_end = event->tstamp_stopped;
+ run_end = event->tstamp->stopped;

- event->total_time_enabled = run_end - event->tstamp_enabled;
+ event->total_time_enabled = run_end - event->tstamp->enabled;

if (event->state == PERF_EVENT_STATE_INACTIVE)
- run_end = event->tstamp_stopped;
+ run_end = event->tstamp->stopped;
else
run_end = perf_event_time(event);

- event->total_time_running = run_end - event->tstamp_running;
+ event->total_time_running = run_end - event->tstamp->running;

FWICT, this is run for ALL events in context with matching CPU.


SNIP
}
@@ -3051,9 +3277,9 @@ void __perf_event_task_sched_out(struct task_struct
*task,
* Called with IRQs disabled
*/
static void cpu_ctx_sched_out(struct perf_cpu_context *cpuctx,
- enum event_type_t event_type)
+ enum event_type_t event_type, int mux)
{
- ctx_sched_out(&cpuctx->ctx, cpuctx, event_type);
+ ctx_sched_out(&cpuctx->ctx, cpuctx, event_type, mux);
}

static void
@@ -3061,29 +3287,8 @@ ctx_pinned_sched_in(struct perf_event_context *ctx,
struct perf_cpu_context *cpuctx)
{
struct perf_event *event;
-
- list_for_each_entry(event, &ctx->pinned_groups, group_entry) {
- if (event->state <= PERF_EVENT_STATE_OFF)
- continue;
- if (!event_filter_match(event))
- continue;

Could we remove or simplify the tests in event_filter_match since the
rb-tree filters by cpu?

Why do you want to remove or simplify event_filter_match()?


-
- /* may need to reset tstamp_enabled */
- if (is_cgroup_event(event))
- perf_cgroup_mark_enabled(event, ctx);
-
- if (group_can_go_on(event, cpuctx, 1))
- group_sched_in(event, cpuctx, ctx);
-
- /*
- * If this pinned group hasn't been scheduled,
- * put it in error state.
- */
- if (event->state == PERF_EVENT_STATE_INACTIVE) {
- update_group_times(event);
- event->state = PERF_EVENT_STATE_ERROR;
- }
- }
+ list_for_each_entry(event, &ctx->pinned_groups, group_entry)
+ ctx_sched_in_pinned_group(ctx, cpuctx, event);
}

static void
@@ -3092,37 +3297,19 @@ ctx_flexible_sched_in(struct perf_event_context
*ctx,
{
struct perf_event *event;
int can_add_hw = 1;
-
- list_for_each_entry(event, &ctx->flexible_groups, group_entry) {
- /* Ignore events in OFF or ERROR state */
- if (event->state <= PERF_EVENT_STATE_OFF)
- continue;
- /*
- * Listen to the 'cpu' scheduling filter constraint
- * of events:
- */
- if (!event_filter_match(event))
- continue;
same as before.