Fwd: Re: [RFC 0/6] optimize ctx switch with rb-tree

From: Alexey Budankov
Date: Thu May 11 2017 - 08:13:15 EST





-------- Forwarded Message --------
Subject: Re: [RFC 0/6] optimize ctx switch with rb-tree
Date: Thu, 11 May 2017 14:53:48 +0300
From: Alexey Budankov <alexey.budankov@xxxxxxxxxxxxxxx>
Organization: Intel Corp.
To: davidcc@xxxxxxxxxx
CC: alexey.budankov@xxxxxxxxxxxxxxx

On 02.05.2017 23:59, Budankov, Alexey wrote:

Subject: Re: [RFC 0/6] optimize ctx switch with rb-tree

On Wed, Apr 26, 2017 at 3:34 AM, Budankov, Alexey <alexey.budankov@xxxxxxxxx> wrote:
Hi David,

I would like to take over on the patches development relying on your help with reviews.

Sounds good.

Hi,

Sorry for long reply due to Russian holidays joined with vacations.

So, I see that event_sched_out() function (4.11.0-rc6+) additionally to disabling an active (PERF_EVENT_STATE_ACTIVE) event in HW also performs updates of tstamp fields for inactive (PERF_EVENT_STATE_INACTIVE) events assigned to "the other" cpus (different from the one that is executing the function):

static void
event_sched_out(struct perf_event *event,
struct perf_cpu_context *cpuctx,
struct perf_event_context *ctx)
{
u64 tstamp = perf_event_time(event);
u64 delta;

WARN_ON_ONCE(event->ctx != ctx);
lockdep_assert_held(&ctx->lock);

/*
* An event which could not be activated because of
* filter mismatch still needs to have its timings
* maintained, otherwise bogus information is return
* via read() for time_enabled, time_running:
*/
-> if (event->state == PERF_EVENT_STATE_INACTIVE &&
-> !event_filter_match(event)) {
-> delta = tstamp - event->tstamp_stopped;
-> event->tstamp_running += delta;
-> event->tstamp_stopped = tstamp;
-> }
->
-> if (event->state != PERF_EVENT_STATE_ACTIVE)
-> return;

I suggest moving this updating work to the context of perf_event_task_tick() callback. It goes per-cpu with 1ms interval by default so the inactive events will get their tstamp fields updated aside of this critical path:

event_sched_out()
group_sched_out()
ctx_sched_out()
perf_rotate_context()
perf_mux_hrtimer_handler()

This change will shorten performance critical processing to active events only as the amount of the events is always HW-limited.


Could you provide me with the cumulative patch set to expedite the ramp up?

This RFC is my latest version. I did not have a good solution on how to solve the problem of handling failure of PMUs that share contexts, and to activate/inactivate them.

Some things to keep in mind when dealing with task-contexts are:
1. The number of PMUs is large and growing, iterating over all PMUs may be expensive (see https://lkml.org/lkml/2017/1/18/859 ).
2. event_filter_match in this RFC is only used because I did not find a better ways to filter out events with the rb-tree. It would be nice if we wouldn't have to check event->cpu != -1 && event->cpu ==
smp_processor_id() and cgroup stuff for every event in task contexts.

Checking an event for cpu affinity will be avoided, at least for the critical path mentioned above.

3. I used the inactive events list in this RFC as a cheaper alternative to threading the rb-tree but it has the problem that events that are removed due to conflict would be placed at the end of the list even if didn't run. I cannot recall if that ever happens. > Using this list also causes problem (2.) maybe threading the tree isa better alternative?

The simplest RB-tree key for the change being suggested could be like {event->cpu, event->id} so events could be organized into per-cpu sub-trees for fast search and enumeration. All software event could be grouped under event->cpu == -1.

4. Making the key in task-events to be {PMU,CPU,last_time_scheduled} (as opposed to {CPU,last_time_scheduled} in the RFC) may simplify sched in by helping to iterate over all events in same PMU at once, simplifying the activation/inactivation of the PMU and making it simple to move to the next PMU on pmu::add errors. The problem with this approach is to find only the PMUs with inactive events without traversing a list of all PMUs. Maybe a per-context list of active PMUs may help (see 1.).

cpu-contexts are much simpler and I think work well with what the RFC does (they are per-pmu already).

This thread has Peter and Mark's original discussion of the rb-tree (https://patchwork.kernel.org/patch/9176121/).

Thanks,
David


What do you think?

Thanks,
Alexey