Re: [RFC/PATCHSET 0/7] perf record: Implement BPF sample filter (v1)

From: Namhyung Kim
Date: Wed Feb 22 2023 - 14:43:17 EST


Hi Jiri,

On Tue, Feb 21, 2023 at 3:54 AM Jiri Olsa <olsajiri@xxxxxxxxx> wrote:
>
> On Tue, Feb 14, 2023 at 10:01:41AM -0800, Namhyung Kim wrote:
> > Hi Ian,
> >
> > On Tue, Feb 14, 2023 at 8:58 AM Ian Rogers <irogers@xxxxxxxxxx> wrote:
> > >
> > > On Mon, Feb 13, 2023 at 9:05 PM Namhyung Kim <namhyung@xxxxxxxxxx> wrote:
> > > >
> > > > Hello,
> > > >
> > > > There have been requests for more sophisticated perf event sample
> > > > filtering based on the sample data. Recently the kernel added BPF
> > > > programs can access perf sample data and this is the userspace part
> > > > to enable such a filtering.
> > > >
> > > > This still has some rough edges and needs more improvements. But
> > > > I'd like to share the current work and get some feedback for the
> > > > directions and idea for further improvements.
> > > >
> > > > The kernel changes are in the tip.git tree (perf/core branch) for now.
> > > > perf record has --filter option to set filters on the last specified
> > > > event in the command line. It worked only for tracepoints and Intel
> > > > PT events so far. This patchset extends it to have 'bpf:' prefix in
> > > > order to enable the general sample filters using BPF for any events.
> > > >
> > > > A new filter expression parser was added (using flex/bison) to process
> > > > the filter string. Right now, it only accepts very simple expressions
> > > > separated by comma. I'd like to keep the filter expression as simple
> > > > as possible.
> > > >
> > > > It requires samples satisfy all the filter expressions otherwise it'd
> > > > drop the sample. IOW filter expressions are connected with logical AND
> > > > operations implicitly.
> > > >
> > > > Essentially the BPF filter expression is:
> > > >
> > > > "bpf:" <term> <operator> <value> ("," <term> <operator> <value>)*
> > > >
> > > > The <term> can be one of:
> > > > ip, id, tid, pid, cpu, time, addr, period, txn, weight, phys_addr,
> > > > code_pgsz, data_pgsz, weight1, weight2, weight3, ins_lat, retire_lat,
> > > > p_stage_cyc, mem_op, mem_lvl, mem_snoop, mem_remote, mem_lock,
> > > > mem_dtlb, mem_blk, mem_hops
> > > >
> > > > The <operator> can be one of:
> > > > ==, !=, >, >=, <, <=, &
> > > >
> > > > The <value> can be one of:
> > > > <number> (for any term)
> > > > na, load, store, pfetch, exec (for mem_op)
> > > > l1, l2, l3, l4, cxl, io, any_cache, lfb, ram, pmem (for mem_lvl)
> > > > na, none, hit, miss, hitm, fwd, peer (for mem_snoop)
> > > > remote (for mem_remote)
> > > > na, locked (for mem_locked)
> > > > na, l1_hit, l1_miss, l2_hit, l2_miss, any_hit, any_miss, walk, fault (for mem_dtlb)
> > > > na, by_data, by_addr (for mem_blk)
> > > > hops0, hops1, hops2, hops3 (for mem_hops)
> > > >
> > > > I plan to improve it with range expressions like for ip or addr and it
> > > > should support symbols like the existing addr-filters. Also cgroup
> > > > should understand and convert cgroup names to IDs.
>
> this seems similar to what ftrace is doing in filter_match_preds,
> I checked the code briefly and I wonder if we shoud be able to write
> that function logic in bpf, assuming that the filter is prepared in
> user space
>
> it might solve the 'part' data problem in generic way.. but I might be
> missing some blocker of course.. just an idea ;-)
>
> could replace the tracepoint filters.. if we actually care

I'm not sure about replacing tracepoint filters. IIRC BPF is optional,
then tracepoints should work without it. From the BPF's perspective,
it has its own way of handling tracepoints so no need to deal with
perf or event tracing (ftrace) for that.

>From the perf's perspective, I think it can use either the existing ftrace
filters or build a new BPF filter for each event. But it cannot use BTF
for perf tracepoint events at least for now. Certainly it can use RAW
sample data and parse the event format to access the fields but I'm
not sure it's worth doing that. :)

Thanks,
Namhyung