Re: Re: [RFC 0/6] optimize ctx switch with rb-tree

From: David Carrillo-Cisneros
Date: Mon May 15 2017 - 15:23:05 EST


> So, I see that event_sched_out() function (4.11.0-rc6+) additionally to
> disabling an active (PERF_EVENT_STATE_ACTIVE) event in HW also performs
> updates of tstamp fields for inactive (PERF_EVENT_STATE_INACTIVE) events
> assigned to "the other" cpus (different from the one that is executing the
> function):
>
> static void
> event_sched_out(struct perf_event *event,
> struct perf_cpu_context *cpuctx,
> struct perf_event_context *ctx)
> {
> u64 tstamp = perf_event_time(event);
> u64 delta;
>
> WARN_ON_ONCE(event->ctx != ctx);
> lockdep_assert_held(&ctx->lock);
>
> /*
> * An event which could not be activated because of
> * filter mismatch still needs to have its timings
> * maintained, otherwise bogus information is return
> * via read() for time_enabled, time_running:
> */
> -> if (event->state == PERF_EVENT_STATE_INACTIVE &&
> -> !event_filter_match(event)) {
> -> delta = tstamp - event->tstamp_stopped;
> -> event->tstamp_running += delta;
> -> event->tstamp_stopped = tstamp;
> -> }
> ->
> -> if (event->state != PERF_EVENT_STATE_ACTIVE)
> -> return;
>
> I suggest moving this updating work to the context of perf_event_task_tick()
> callback. It goes per-cpu with 1ms interval by default so the inactive
> events will get their tstamp fields updated aside of this critical path:
>
> event_sched_out()
> group_sched_out()
> ctx_sched_out()
> perf_rotate_context()
> perf_mux_hrtimer_handler()
>
> This change will shorten performance critical processing to active events
> only as the amount of the events is always HW-limited.
>

I can see that updating times for all events not schedule due to
event_filter_match is problematic, but I don't see much gain by moving
it to a callback, since the unnecessary and cache unfriendly work
still needs to be performed.

What about trying to eliminate that update altogether? I am thinking
on using context's timestamp (and or add other time keeping variables
to it) to calculate/update tstamp_running and tstamp_stopped on event
read, so that unscheduled events don't need to update time every
context switch. What do you think?

>>
>>> Could you provide me with the cumulative patch set to expedite the ramp
>>> up?
>>
>>
>> This RFC is my latest version. I did not have a good solution on how to
>> solve the problem of handling failure of PMUs that share contexts, and to
>> activate/inactivate them.
>>
>> Some things to keep in mind when dealing with task-contexts are:
>> 1. The number of PMUs is large and growing, iterating over all PMUs may
>> be expensive (see https://lkml.org/lkml/2017/1/18/859 ).
>> 2. event_filter_match in this RFC is only used because I did not find a
>> better ways to filter out events with the rb-tree. It would be nice if we
>> wouldn't have to check event->cpu != -1 && event->cpu ==
>> smp_processor_id() and cgroup stuff for every event in task contexts.
>
>
> Checking an event for cpu affinity will be avoided, at least for the
> critical path mentioned above.
>
>> 3. I used the inactive events list in this RFC as a cheaper alternative
>> to threading the rb-tree but it has the problem that events that are removed
>> due to conflict would be placed at the end of the list even if didn't run. I
>> cannot recall if that ever happens. > Using this list also causes problem
>> (2.) maybe threading the tree isa better alternative?
>
>
> The simplest RB-tree key for the change being suggested could be like
> {event->cpu, event->id} so events could be organized into per-cpu sub-trees
> for fast search and enumeration. All software event could be grouped under
> event->cpu == -1.

Software events (aka events in "software" task context) may have cpu != -1.

I think you need some timestamp in the key to help you find the next
event to schedule.

Do you have plans to address the problem of events in software context
that fail to schedule (like CQM) or more that one hardware PMU?

For me, this whole thing turn out to be much more complex that I
initially anticipated (one of the reasons I put my version on hold).
It may be a good idea to iron out event's time keeping before trying
to improve the event's data structures used in event scheduling (i.e.
pinned/flexible lists vs rb-tree). Just to support my point, look at
how cgroup events keep their time (__update_cgrp_time et al.).

David