Re: [RFC PATCH tip 4/5] use BPF in tracing filters

From: Steven Rostedt
Date: Tue Dec 03 2013 - 20:11:42 EST


On Wed, 04 Dec 2013 09:48:44 +0900
Masami Hiramatsu <masami.hiramatsu.pt@xxxxxxxxxxx> wrote:

> (2013/12/03 13:28), Alexei Starovoitov wrote:
> > Such filters can be written in C and allow safe read-only access to
> > any kernel data structure.
> > Like systemtap but with safety guaranteed by kernel.
> >
> > The user can do:
> > cat bpf_program > /sys/kernel/debug/tracing/.../filter
> > if tracing event is either static or dynamic via kprobe_events.
> >
> > The program can be anything as long as bpf_check() can verify its safety.
> > For example, the user can create kprobe_event on dst_discard()
> > and use logically following code inside BPF filter:
> > skb = (struct sk_buff *)ctx->regs.di;
> > dev = bpf_load_pointer(&skb->dev);
> > to access 'struct net_device'
> > Since its prototype is 'int dst_discard(struct sk_buff *skb);'
> > 'skb' pointer is in 'rdi' register on x86_64
> > bpf_load_pointer() will try to fetch 'dev' field of 'sk_buff'
> > structure and will suppress page-fault if pointer is incorrect.
>
> Hmm, I doubt it is a good way to integrate with ftrace.
> I prefer to use this for replacing current ftrace filter,

I'm not sure how we can do that. Especially since the bpf is very arch
specific, and the current filters work for all archs.

> fetch functions and actions. In that case, we can continue
> to use current interface but much faster to trace.
> Also, we can see what filter/arguments/actions are set
> on each event.

There's also the problem that the current filters work with the results
of what is written to the buffer, not what is passed in by the trace
point, as that isn't even displayed to the user.

For example, sched_switch gets passed struct task_struct *prev, and
*next, from that we save prev_comm, prev_pid, prev_prio, prev_state,
next_comm, next_prio and next_state. These are expressed to the user
by the format file of the event:

field:char prev_comm[32]; offset:16;
size:16; signed:1; field:pid_t prev_pid;
offset:32; size:4; signed:1; field:int
prev_prio; offset:36; size:4; signed:1;
field:long prev_state; offset:40; size:8;
signed:1; field:char next_comm[32]; offset:48;
size:16; signed:1; field:pid_t next_pid;
offset:64; size:4; signed:1; field:int
next_prio; offset:68; size:4; signed:1;

And the filters can check "next_prio > 10" and what not. The bpf
program needs to access next->prio. There's nothing that shows the user
what is passed to the tracepoint, and from that, what structure member
to use from there. The user would be required to look at the source
code of the given kernel. A requirement not needed by the current
implementation.

Also, there's results that can not be trivially converted. Taking a
quick look at some TRACE_EVENT() structures, I found bcache_bio that
has this:

TP_fast_assign(
__entry->dev = bio->bi_bdev->bd_dev;
__entry->sector = bio->bi_sector;
__entry->nr_sector = bio->bi_size >> 9;
blk_fill_rwbs(__entry->rwbs, bio->bi_rw, bio->bi_size);
),

Where the blk_fill_rwbs() updates the status of the entry->rwbs based
on the bi_rw field. A filter must remain backward compatible to
something like:

rwbs == "w" or rwbs =~ '*w*'


Now maybe we can make the filter code use some of the bpf if possible,
but to get the result, it still needs to write to the ring buffer, and
discard it if it is incorrect. Which will not make it any faster than
the original trace, but perhaps faster than the trace + current filter.

The speed up that was shown was because we were processing the
parameters of the trace point and not the result. That currently
requires the user to have full access to the source of the kernel they
are tracing.

-- Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/