Re: [PATCH v2 0/5] Statsfs: a new ram-based file sytem for Linux kernel statistics

From: Jonathan Adams
Date: Thu May 07 2020 - 13:46:22 EST


On Mon, May 4, 2020 at 4:05 AM Emanuele Giuseppe Esposito
<eesposit@xxxxxxxxxx> wrote:
...
> Statsfs offers a generic and stable API, allowing any kind of
> directory/file organization and supporting multiple kind of aggregations
> (not only sum, but also average, max, min and count_zero) and data types
> (all unsigned and signed types plus boolean). The implementation, which is
> a generalization of KVMâs debugfs statistics code, takes care of gathering
> and displaying information at run time; users only need to specify the
> values to be included in each source.
>
> Statsfs would also be a different mountpoint from debugfs, and would not
> suffer from limited access due to the security lock down patches. Its main
> function is to display each statistics as a file in the desired folder
> hierarchy defined through the API. Statsfs files can be read, and possibly
> cleared if their file mode allows it.
>
> Statsfs has two main components: the public API defined by
> include/linux/statsfs.h, and the virtual file system which should end up
> in /sys/kernel/stats.

This is good work. As David Rientjes mentioned, I'm currently investigating
a similar project, based on a google-internal debugfs-based FS we call
"metricfs". It's
designed in a slightly different fashion than statsfs here is, and the
statistics exported are
mostly fed into our OpenTelemetry-like system. We're motivated by
wanting an upstreamed solution, so that we can upstream the metrics we
create that are of general interest, and lower the overall rebasing
burden for our tree.

Some feedback on your design as proposed:

- the 8/16/32/64 signed/unsigned integers seems like a wart, and the
built-in support to grab any offset from a structure doesn't seem like
much of an advantage. A simpler interface would be to just support an
"integer" (possibly signed/unsigned) type, which is always 64-bit, and
allow the caller to provide a function pointer to retrieve the value,
with one or two void *s cbargs. Then the framework could provide an
offset-based callback (or callbacks) similar to the existing
functionality, and a similar one for per-CPU based statistics. A
second "clear" callback could be optionally provided to allow for
statistics to be cleared, as in your current proposal.

- A callback-style interface also allows for a lot more flexibility
in sourcing values, and doesn't lock your callers into one way of
storing them. You would, of course, have to be clear about locking
rules etc. for the callbacks.

- Beyond the statistic's type, one *very* useful piece of metadata
for telemetry tools is knowing whether a given statistic is
"cumulative" (an unsigned counter which is only ever increased), as
opposed to a floating value (like "amount of memory used").

I agree with the folks asking for a binary interface to read
statistics, but I also agree that it can be added on later. I'm more
concerned with getting the statistics model and capabilities right
from the beginning, because those are harder to adjust later.

Would you be open to collaborating on the statsfs design? As
background for this discussion, here are some details of how our
metricfs implementation approaches statistics:

1. Each metricfs metric can have one or two string or integer "keys".
If these exist, they expand the metric from a single value into a
multi-dimensional table. For example, we use this to report a hash
table we keep of functions calling "WARN()", in a 'warnings'
statistic:

% cat .../warnings/values
x86_pmu_stop 1
%

Indicates that the x86_pmu_stop() function has had a WARN() fire once
since the system was booted. If multiple functions have fired
WARN()s, they are listed in this table with their own counts. [1] We
also use these to report per-CPU counters on a CPU-by-CPU basis:

% cat .../irq_x86/NMI/values
0 42
1 18
... one line per cpu
%

2. We also export some metadata about each statistic. For example,
the metadata for the NMI counter above looks like:

% cat .../NMI/annotations
DESCRIPTION Non-maskable\ interrupts
CUMULATIVE
% cat .../NMI/fields
cpu value
int int
%

(Describing the statistic, marking it as "cumulative", and saying the
fields are "cpu" and "value", both ints). The metadata doesn't change
much, so having separate files allows the user-space agent to read
them once and then the values multiple times.

3. We have a (very few) statistics where the value itself is a string,
usually for device statuses.

For our use cases, we generally don't both output a statistic and it's
aggregation from the kernel; either we sum up things in the kernel
(e.g. over a bunch of per-cpu or per-memcg counters) and only have the
result statistic, or we expect user-space to sum up the data if it's
interested. The tabular form makes it pretty easy to do so (i.e. you
can use awk(1) to sum all of the per-cpu NMI counters). We don't
generally reset statistics, except as a side effect of removing a
device.

Thanks again for the patchset, and for pointing out that KVM also
needs statistics sent out; it's great that there is interest in this.

Cheers,
- jonathan

P.S. I also have a couple (non-critical) high-level notes:
* It's not clear what tree your patches are against, or their
dependencies; I was able to get them to apply to linux-next master
with a little massaging, but then they failed to compile because
they're built on top of your "libfs: group and simplify linux fs code"
patch series you sent out in late april. Including a git link or at
least a baseline tree and a list of the patch series you rely upon is
helpful for anyone wanting to try out your changes.

* The main reason I was trying to try out your patches was to get a
sense of the set of directories and things the KVM example generates;
while it's apparently the same as the existing KVM debugfs tree, it's
useful to know how this ends up looking on a real system, and I'm not
familiar with the KVM stats. Since this patch is intended slightly
more broadly than just KVM, it might have been useful to include
sample output for those not familiar with how things are today.


[1] We also use this to export various network/storage statistics
on a per-device basis. e.g. network bytes received counts:

% cat .../rx_bytes/values
lo 501360681
eth0 1457631256
...
%