Re: [PATCH tip 0/3] Improvements of scheduler related Tracepoints

From: Alexei Starovoitov
Date: Fri Dec 15 2017 - 12:11:10 EST


On 12/14/17 11:39 PM, Peter Zijlstra wrote:
On Thu, Dec 14, 2017 at 07:16:00PM -0800, Alexei Starovoitov wrote:
On 12/14/17 12:49 PM, Peter Zijlstra wrote:
On Thu, Dec 14, 2017 at 12:20:41PM -0800, Teng Qin wrote:
This set of commits attempts to improve three scheduler related
Tracepoints: sched_switch, sched_process_fork, sched_process_exit.

Firstly, these commit add additional flag values, namely preempt,
clone_flags and group_dead to these Tracepoints, to make information
exposed via the Tracepoints more useful and complete.

Secondly, these commits exposes task_struct pointers in these
Tracepoints. The task_struct pointers are arguments of the Tracepoints
and currently only used to compute struct field values. But for BPF
programs attached to these Tracepoints, we may want to read additional
task information via the task_struct pointers. This is currently either
impossible, or we have to make assumption of whether the Tracepoint is
running from previous / parent or next / child, and use current pointer
instead. Exposing the task_struct pointers explicitly makes such use
case easier and more reliable.


NAK

not sure what is the concern here.
Is it first or second part of the above ?

Definitely the second, but also the first. You know I would have ripped
out all scheduler tracepoints if I could have. They're a pain in the
arse.

A lot of people want to add to the tracepoints, with the end result that
they'll end up a big bloated pile of useless crap. The first part is
just the pieces you want added.

As to the second, that's complete crap; that just makes everything
slower for bodies benefit. If you register a traceprobe you already get
access to these things.

I think your problem is that you use perf to get access to the
tracepoints, which them means you have to do disgusting things like
this.

yeah. Currently bpf progs are called at the end of
perf_trace_##call()
{
.. regular tracepoint copy craft
perf_trace_run_bpf_submit( &copied args )
}

from bpf pov we'd rather get access to raw args passed into
perf_trace_##call.
Sounds like you're suggesting to let bpf side register its
progs directly via tracepoint_probe_register() ?
That would solve the whole thing really nicely indeed.

How such api would look like ?
Something like extending kprobe/uprobe fd-based perf_event_open?
https://www.spinics.net/lists/netdev/msg470567.html
btw could you please apply that set to tip tree
or you want us to route it via bpf-next -> net-next ?

Thanks