Re: "statsfs" API design

From: Greg Kroah-Hartman
Date: Sat Nov 09 2019 - 10:50:10 EST


On Wed, Nov 06, 2019 at 04:56:25PM +0100, Paolo Bonzini wrote:
> Hi all,
>
> statsfs is a proposal for a new Linux kernel synthetic filesystem, to be
> mounted in /sys/kernel/stats, which exposes subsystem-level statistics
> in sysfs. Reading need not be particularly lightweight, but writing
> must be fast. Therefore, statistics are gathered at a fine-grain level
> in order to avoid locking or atomic operations, and then aggregated by
> statsfs until the desired granularity.

Wait, reading a statistic from userspace can be slow, but writing to it
from userspace has to be fast? Or do you mean the speed is all for
reading/writing the value within the kernel?

> The first user of statsfs would be KVM, which is currently exposing its
> stats in debugfs. However, debugfs access is now limited by the
> security lock down patches, and in addition statsfs aims to be a
> more-or-less stable API, hence the idea of making it a separate
> filesystem and mount point.

Nice, I've had people ask about something like this for a while now.
For the most part they just dump stuff in sysfs instead (see the DRM
patches recently for people attempting to do that for debugfs values as
well.)

> A few people have already expressed interest in this. Christian
> Borntraeger presented on the kvm_stat tool recently at KVM Forum and was
> also thinking about using some high-level API in debugfs. Google has
> KVM patches to gather statistics in a binary format; it may be useful to
> add this kind of functionality (and some kind of introspection similar
> to what tracing does) to statsfs too in the future, but this is
> independent from the kernel API. I'm also CCing Alex Williamson, in
> case VFIO is interested in something similar, and Steven Rostedt because
> apparently he has enough free time to write poetry in addition to code.
>
> There are just two concepts in statsfs, namely "values" (aka files) and
> "sources" (directories).
>
> A value represents a single quantity that is gathered by the statsfs
> client. It could be the number of vmexits of a given kind, the amount
> of memory used by some data structure, the length of the longest hash
> table chain, or anything like that.
>
> Values are described by a struct like this one:
>
> struct statsfs_value {
> const char *name;
> enum stat_type type; /* STAT_TYPE_{BOOL,U64,...} */
> u16 aggr_kind; /* Bitmask with zero or more of
> * STAT_AGGR_{MIN,MAX,SUM,...}
> */
> u16 mode; /* File mode */
> int offset; /* Offset from base address
> * to field containing the value
> */
> };
>
> As you can see, values are basically integers stored somewhere in a
> struct. The statsfs_value struct also includes information on which
> operations (for example sum, min, max, average, count nonzero) it makes
> sense to expose when the values are aggregated.

What can userspace do with that info?

> Sources form the bulk of the statsfs API. They can include two kinds of
> elements:
>
> - values as described above. The common case is to have many values
> with the same base address, which are represented by an array of struct
> statsfs_value
>
> - subordinate sources
>
> Adding a subordinate source has two effects:
>
> - it creates a subdirectory for each subordinate source
>
> - for each value in the subordinate sources which has aggr_kind != 0,
> corresponding values will be created in the parent directory too. If
> multiple subordinate sources are backed by the same array of struct
> statsfs_value, values from all those sources will be aggregated. That
> is, statsfs will compute these from the values of all items in the list
> and show them in the parent directory.
>
> Writable values can only be written with a value of zero. Writing zero
> to an aggregate zeroes all the corresponding values in the subordinate
> sources.
>
> Sources are manipulated with these four functions:
>
> struct statsfs_source *statsfs_source_create(const char *fmt,
> ...);
> void statsfs_source_add_values(struct statsfs_source *source,
> struct statsfs_value *stat,
> int n, void *ptr);
> void statsfs_source_add_subordinate(
> struct statsfs_source *source,
> struct statsfs_source *sub);
> void statsfs_source_remove_subordinate(
> struct statsfs_source *source,
> struct statsfs_source *sub);
>
> Sources are reference counted, and for this reason there is also a pair
> of functions in the usual style:
>
> void statsfs_source_get(struct statsfs_source *);
> void statsfs_source_put(struct statsfs_source *);
>
> Finally,
>
> void statsfs_source_register(struct statsfs_source *source);
>
> lets you create a toplevel statsfs directory.
>
> As a practical example, KVM's usage of debugfs could be replaced by
> something like this:
>
> /* Globals */
> struct statsfs_value vcpu_stats[] = ...;
> struct statsfs_value vm_stats[] = ...;
> static struct statsfs_source *kvm_source;
>
> /* On module creation */
> kvm_source = statsfs_source_create("kvm");
> statsfs_source_register(kvm_source);
>
> /* On VM creation */
> kvm->src = statsfs_source_create("%d-%d\n",
> task_pid_nr(current), fd);
> statsfs_source_add_values(kvm->src, vm_stats,
> ARRAY_SIZE(vm_stats),
> &kvm->stats);
> statsfs_source_add_subordinate(kvm_source, kvm->src);
>
> /* On vCPU creation */
> vcpu_src = statsfs_source_create("vcpu%d\n", vcpu->vcpu_id);
> statsfs_source_add_values(vcpu_src, vcpu_stats,
> ARRAY_SIZE(vcpu_stats),
> &vcpu->stats);
> statsfs_source_add_subordinate(kvm->src, vcpu_src);
> /*
> * No need to keep the vcpu_src around since there's no
> * separate vCPU deletion event; rely on refcount
> * exclusively.
> */
> statsfs_source_put(vcpu_src);
>
> /* On VM deletion */
> statsfs_source_remove_subordinate(kvm_source, kvm->src);
> statsfs_source_put(kvm->src);
>
> /* On KVM exit */
> statsfs_source_put(kvm_source);
>
> How does this look?

Where does the actual values get changed that get reflected in the
filesystem?

I have some old notes somewhere about what people really want when it
comes to a good "statistics" datatype, that I was thinking of building
off of, but that seems independant of what you are doing here, right?
This is just exporting existing values to userspace in a semi-sane way?

Anyway, I like the idea, but what about how this is exposed to
userspace? The criticism of sysfs for statistics is that it is too slow
to open/read/close lots of files and tough to get "at this moment in
time these are all the different values" snapshots easily. How will
this be addressed here?

thanks,

greg k-h