Re: "statsfs" API design

From: Alexey Dobriyan
Date: Sun Nov 10 2019 - 10:34:32 EST


On Sun, Nov 10, 2019 at 10:14:35AM +0100, Greg KH wrote:
> On Sat, Nov 09, 2019 at 09:44:41PM +0300, Alexey Dobriyan wrote:
> > > statsfs is a proposal for a new Linux kernel synthetic filesystem,
> > > to be mounted in /sys/kernel/stats
> >
> > I think /proc experiment teaches pretty convincingly that dressing
> > things into a filesystem can be done but ultimately is a stupid idea.
> > It adds so much overhead for small-to-medium systems.
> >
> > > The first user of statsfs would be KVM, which is currently exposing
> > > its stats in debugfs
> >
> > > Google has KVM patches to gather statistics in a binary format
> >
> > Which is a right thing to do.
>
> It's always "simpler" to just take binary data and suck it in.

Faster too!

> That works for a year or so until another value needs to be supported.
> Or removed. Or features are backported.
>
> The reason text values in individual files work is they are "self
> describable" and "self discoverable".

Untrue. Applications always knows what the data means, by definition:

"0x42"

What is this? 4-char NUL-terminated string? Or an integer 66? Or a
4-byte nonce blob for some kind of crypto algorithm.

In the other direction: describe every field of /proc/*/stat file
without looking to the manpage:

$ cat /proc/self/stat
5349 (cat) R 5342 5349 5342 34826 5349 4210688 91 0 0 0 0 0 0 0 20 0 1 0 864988 9183232 184 18446744073709551615 94352028622848 94352028651936 140733810522864 0 0 0 0 0 0 0 0 0 17 5 0 0 0 0 0 94352030751824 94352030753376 94352060055552 140733810527527 140733810527547 140733810527547 140733810532335 0

> You "know" what the value is and
> that it is supported because the file is there or not. With binary
> values in a single file you do not know any of that.

You _always_ know that.

> So you need some way of describing the data to userspace in order for
> this to work properly over the next 20+ years.
>
> Maybe something like varlink which describes the data coming from the
> kernel in an easy-to-handle format? Or something else, but just using
> blobs does not work over the long-term, sorry.

Text doesn't work either. Why do you think /proc/*/maps have 1 space
character at the end of anon mappings? Because someone fucked up.
Here is how people use /proc in the field:

https://stackoverflow.com/questions/3596781/how-to-detect-if-the-current-process-is-being-run-by-gdb

open /proc/*/status
read
strstr("TracerPid:")

I'd humbly suggest to define the minimum amount of work for the task:
* some kind of percpu loop to gather stats
* some kind of accumulation code, possibly with min/max/avg
* write clear data
* copy_to_user

and realise that everything alse is a waste of electricity, namely,

* pathname allocation (4KB)
* VFS '/' split, lookups (/sys/kernel/.../" means 3+ lookups
* 192 bytes for each dentry
* 550+ bytes per inode
* 3 system calls per act of gathering statistics
userspace will be written in the most stupid way possible
without openat() etc
* userspace snprintf() for pathname
* kernel space snprintf() somewhere
* multiple copying inside kernel (vsnprintf.c)
* general inability for userspace to estimate the amount of data in decimal
(nobody does that), so nicely sized buffers of 4K or 1K or 16KB (bash)
will be used which is a waste.