Re: [PATCH v3 0/2] kstats: kernel metric collector

From: Luigi Rizzo
Date: Wed Feb 26 2020 - 12:26:37 EST


[this reply also addresses comments from Alexei and Peter]

On Wed, Feb 26, 2020 at 7:00 AM Toke HÃiland-JÃrgensen <toke@xxxxxxxxxx> wrote:
>
> Luigi Rizzo <lrizzo@xxxxxxxxxx> writes:
>
> > This patchset introduces a small library to collect per-cpu samples and
> > accumulate distributions to be exported through debugfs.
> >
> > This v3 series addresses some initial comments (mostly style fixes in the
> > code) and revises commit logs.
>
> Could you please add a proper changelog spanning all versions of the
> patch as you iterate?

Will do (v2->v3 was just a removal of stray Change-Id from the log messages)

> As for the idea itself; picking up this argument you made on v1:
>
> > The tracepoint/kprobe/kretprobe solution is much more expensive --
> > from my measurements, the hooks that invoke the various handlers take
> > ~250ns with hot cache, 1500+ns with cold cache, and tracing an empty
> > function this way reports 90ns with hot cache, 500ns with cold cache.
>
> I think it would be good if you could include an equivalent BPF-based
> implementation of your instrumentation example so people can (a) see the
> difference for themselves and get a better idea of how the approaches
> differ in a concrete case and (b) quantify the difference in performance
> between the two implementations.

At the moment, a bpf version is probably beyond my skills and goals,
but I hope the
following comments can clarify the difference in approach/performance:

- my primary goal, implemented in patch 1/2, is to have this code embedded in
the kernel, _always available_ , even to users without the skills to
hack up the
necessary bpf code, or load a bpf program (which may not be allowed in
certain environments), and eventually replace and possibly improve custom
variants of metric collections which we already have (or wished to,
but don't have
because there wasn't a convenient library to use for them).

- I agree that this code can be recompiled in bpf (using a
BPF_MAP_TYPE_PERCPU_ARRAY for storage, and kstats_record() and
ks_show_entry() should be easy to convert).

- the runtime cost and complexity of hooking bpf code is still a bit
unclear to me.
kretprobe or tracepoints are expensive, I suppose that some lean hook
replace register_kretprobe() may exist and the difference from inline
annotations would be marginal (we'd still need to put in the hooks around
the code we want to time, though, so it wouldn't be a pure bpf solution).
Any pointers to this are welcome; Alexei mentioned fentry/fexit and
bpf trampolines, but I haven't found an example that lets me do something
equivalent to kretprobe (take a timestamp before and one after a function
without explicit instrumentation)

- I still see some huge differences in usability, and this is in my opinion
one very big difference between the two approaches. The systems where data
collection may be of interest are not necessarily accessible to developers
with the skills to write custom bpf code, or load bpf modules
(security policies
may prevent that). One thing is to tell a sysadmin to run
"echo trace foo > /sys/kernel/debug/kstats/_config"
or "watch grep CPUS /sys/kernel/debug/kstats/bar",
another one is to tell them to load a bpf program (or write their own one).

thanks for the feedback
luigi