Re: [PATCH RFC net-next 11/14] tracing: allow eBPF programs to be attached to events

From: Alexei Starovoitov
Date: Wed Jul 02 2014 - 02:14:52 EST


On Tue, Jul 1, 2014 at 10:32 PM, Namhyung Kim <namhyung@xxxxxxxxx> wrote:
> On Fri, 27 Jun 2014 17:06:03 -0700, Alexei Starovoitov wrote:
>> User interface:
>> cat bpf_123 > /sys/kernel/debug/tracing/__event__/filter
>>
>> where 123 is an id of the eBPF program priorly loaded.
>> __event__ is static tracepoint event.
>> (kprobe events will be supported in the future patches)
>>
>> eBPF programs can call in-kernel helper functions to:
>> - lookup/update/delete elements in maps
>> - memcmp
>> - trace_printk
>
> ISTR Steve doesn't like to use trace_printk() (at least for production
> kernels) anymore. And I'm not sure it'd work if there's no existing
> trace_printk() on a system.

yes. I saw big warning that trace_printk_init_buffers() emits.
The idea here is to use eBPF programs for live kernel debugging.
Instead of adding printk() and recompiling, just write a program,
attach it to some event, and printk whatever is interesting.
My only concern about printk() was that it dumps things into trace
buffers (which is still better than dumping stuff to syslog), but now
(since Andy almost convinced me to switch to 'fd' based interface)
we can have seq_printk-like that prints into special buffer. So that
user space does 'read(ufd)' and receives whatever program has
printed. I think that would be much cleaner.

>> + if (unlikely(ftrace_file->flags & FTRACE_EVENT_FL_FILTERED) && \
>> + unlikely(ftrace_file->event_call->flags & TRACE_EVENT_FL_BPF)) { \
>> + struct bpf_context __ctx; \
>> + \
>> + populate_bpf_context(&__ctx, args, 0, 0, 0, 0, 0); \
>> + trace_filter_call_bpf(ftrace_file->filter, &__ctx); \
>> + return; \
>> + } \
>> + \
>
> Hmm.. But it seems the eBPF prog is not a filter - it'd always drop the
> event. And I think it's better to use a recorded entry rather then args
> as a bpf_context so that tools like perf can manipulate it at compile
> time based on the event format.

Can manipulate what at compile time? Entry records of tracepoints are
hard coded based on the event. For verifier it's easier to treat all
tracepoint events as they received the same 'struct bpf_context'
of N arguments then the same program can be attached to multiple
tracepoint events at the same time.
I thought about making verifier specific for _every_ tracepoint event,
but it complicates the user interface, since 'bpf_context' is now different
for every program. I think args are much easier to deal with from C
programming point of view, since program can go a fetch the same
fields that tracepoint 'fast_assign' macro does.
Also skipping buffer allocation and fast_assign gives very sizable
performance boost, since the program will access only what it needs to.

The return value of eBPF program is ignored, since I couldn't think
of use case for it. We can change it to be more 'filter' like and interpret
return value as true/false, whether to record this event or not. Thoughts?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/