Re: [RFC PATCH v2 0/5] Additional scheduling information in tracepoints

From: Mathieu Desnoyers
Date: Mon Sep 26 2016 - 15:36:50 EST


----- On Sep 26, 2016, at 8:27 AM, Peter Zijlstra peterz@xxxxxxxxxxxxx wrote:

> On Fri, Sep 23, 2016 at 12:49:30PM -0400, Julien Desfossez wrote:
>> With this macro, we propose new versions of the sched_switch, sched_waking,
>> sched_process_fork and sched_pi_setprio tracepoint probes that contain more
>> scheduling information and get rid of the "prio" field. We also add the PI
>> information to these tracepoints, so if a process is currently boosted, we show
>> the name and PID of the top waiter. This allows to quickly see the blocking
>> chain even if some of the trace background is missing.
>
> Urgh.. bigger mess than ever :-(
>
> So I thought the initial idea was to provide a 'blocked-on' tracepoint,
> along with with the 'prio-changed' tracepoint, so you can reconstruct
> the entire PI chain.
>
> The only problem with that was initial state; when you start tracing (or
> miss the start of a trace) its hard (impossible) to know what the
> current state is.
>
> But now you send a patch-set that just adds a metric ton of tracepoints.
>
> This doesn't fix the current mess, it makes it worse :-(

There are actually four problems we try to tackle here with this
patchset:

1) Missing explicit priority change instrumentation

We're covering it by adding a new "sched_update_prio" callsite
and user-visible tracepoint.


2) Missing "blocked-on" information for PI

We're covering it by adding a new user-visible tracepoint to
the sched_pi_setprio callsite. The following fields provide
the blocked-on info:

top_waiter_comm, top_waiter_pid

We chose to add it in a new user-visible tracepoint rather
than the current sched_pi_setprio so the new event would not
expose the internal "prio" task struct field, which is an internal
implementation detail of the scheduler AFAIU, and could
go away eventually.

We could move those fields to the preexisting sched_pi_setprio
event if you prefer, but then we would have to keep the "oldprio"
and "newprio" fields forever.

I would not call this a "blocked-on" tracepoint, because it is
specific to PI. The general "blocking" concept imply blocking
on a resource (e.g. waitqueue), and we only know which PID we
were waiting for when we are later awakened. In the PI case,
we know which PID owns the resource we are blocked on.


3) Missing deadline scheduler instrumentation

We understood that exposing "prio" really does not cover the
deadline scheduler, as is clearly pointed out in your patchset.
We have added deadline scheduler info to a new set of user-visible
tracepoints, which are connected to the pre-existing tracepoint
callsites in the scheduler. We're therefore not "adding" scheduler
tracepoints in the fast-path source-code wise. We have named those
alternative versions with a "_prio" suffix.


4) Missing initial state

The sched_switch_prio tracepoint deals with the problem of missing
initial state: it's a tracepoint that occurs periodically, and we
can therefore get the initial state of a running thread when it
is scheduled. We chose to present it as an alternative tracepoint
from a user POV because we did not want to bloat the current
sched_switch event with lots of extra fields when prio information
is not needed.

Do you recommend that we bring those new extra fields into the
pre-existing tracepoints instead, even considering the extra
bloat ?

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com