Re: [PATCH] fs/proc: introduce /proc/stat2 file

From: Daniel Colascione
Date: Wed Nov 07 2018 - 10:42:27 EST


On Wed, Nov 7, 2018 at 10:03 AM, Miklos Szeredi <miklos@xxxxxxxxxx> wrote:
> On Wed, Nov 7, 2018 at 12:48 AM, Andrew Morton
> <akpm@xxxxxxxxxxxxxxxxxxxx> wrote:
>> On Mon, 29 Oct 2018 23:04:45 +0000 Daniel Colascione <dancol@xxxxxxxxxx> wrote:
>>
>>> On Mon, Oct 29, 2018 at 7:25 PM, Davidlohr Bueso <dave@xxxxxxxxxxxx> wrote:
>>> > This patch introduces a new /proc/stat2 file that is identical to the
>>> > regular 'stat' except that it zeroes all hard irq statistics. The new
>>> > file is a drop in replacement to stat for users that need performance.
>>>
>>> For a while now, I've been thinking over ways to improve the
>>> performance of collecting various bits of kernel information. I don't
>>> think that a proliferation of special-purpose named bag-of-fields file
>>> variants is the right answer, because even if you add a few info-file
>>> variants, you're still left with a situation where a given file
>>> provides a particular caller with too little or too much information.
>>> I'd much rather move to a model in which userspace *explicitly* tells
>>> the kernel which fields it wants, with the kernel replying with just
>>> those particular fields, maybe in their raw binary representations.
>>> The ASCII-text bag-of-everything files would remain available for
>>> ad-hoc and non-performance critical use, but programs that cared about
>>> performance would have an efficient bypass. One concrete approach is
>>> to let users open up today's proc files and, instead of read(2)ing a
>>> text blob, use an ioctl to retrieve specified and targeted information
>>> of the sort that would normally be encoded in the text blob. Because
>>> callers would open the same file when using either the text or binary
>>> interfaces, little would have to change, and it'd be easy to implement
>>> fallbacks when a particular system doesn't support a particular
>>> fast-path ioctl.
>
> Please. Sysfs, with the one value per file rule, was created exactly
> for the purpose of eliminating these sort of problems with procfs. So
> instead of inventing special purpose interfaces for proc, just make
> the info available in sysfs, if not already available.

First of all, is sysfs even right? Some people, for whatever reason,
are extremely particular about the purposes of various virtual
filesystems. "No, sysfs is for exposing kernel objects, not
configuration!" is something I've heard more than once. Who's to say
that sysfs is for exposing /proc/pid/stat, which isn't a "kernel
object" itself? (A process is not its struct task.) More generally,
objections about APIs rooted in arcane kernel-internal considerations
about the purposes of various virtual filesystems --- procfs, sysfs,
debugfs, configfs --- makes the userspace API worse, because it
enshrines implementation details (is this thing a kobject or not?) in
public API. If I had my way, we'd have continued putting *everything*
in procfs and just make procfs the "I want stuff from the kernel" API.
Nobody in userspace cares about these filesystem divisions.

Second, slurping from a sysfs-style setup in which there's one file
per piece of information creates massive overhead, because there's
currently no way to open multiple paths with one system call and no
way to read from multiple FDs with one system call. If you want this
kind of setup to work, you need some kind of batched openat-and-read
system call mechanism. I think a simple "get information from this
procfs FD" system call --- something like statx --- is both cleaner
and more efficient. Plus, without a batch operation, there's no way to
achieve atomicity. It's perfectly reasonable for userspace to request
some bits of information about a process want these bits to be
consistent with each other. Now, such an API would be good to add, but
it's not enough, since a generic batched openat-and-read would still
have to go through VFS, create struct files, (probably) encode to
ASCII, and so on. Why should any system pay to do that much work when
the fields anyone might want could be obtained with a simple
copy_to_user?

Third, and finally, a sysfs-style tree for processes doesn't currently
exist. Would you propose having *two* *different* representations of
the process list as virtual filesystems? That's another pointless
exposure of internal kernel divisions in the user API. We already have
procfs. Let's just make it better.