Re: [PATCH 0/5] ioctl()-based API to query VMAs from /proc/<pid>/maps

From: Andrii Nakryiko
Date: Sat May 04 2024 - 17:51:34 EST


On Sat, May 4, 2024 at 4:24 AM Christian Brauner <brauner@xxxxxxxxxx> wrote:
>
> On Fri, May 03, 2024 at 05:30:01PM -0700, Andrii Nakryiko wrote:
> > Implement binary ioctl()-based interface to /proc/<pid>/maps file to allow
> > applications to query VMA information more efficiently than through textual
> > processing of /proc/<pid>/maps contents. See patch #2 for the context,
> > justification, and nuances of the API design.
> >
> > Patch #1 is a refactoring to keep VMA name logic determination in one place.
> > Patch #2 is the meat of kernel-side API.
> > Patch #3 just syncs UAPI header (linux/fs.h) into tools/include.
> > Patch #4 adjusts BPF selftests logic that currently parses /proc/<pid>/maps to
> > optionally use this new ioctl()-based API, if supported.
> > Patch #5 implements a simple C tool to demonstrate intended efficient use (for
> > both textual and binary interfaces) and allows benchmarking them. Patch itself
> > also has performance numbers of a test based on one of the medium-sized
> > internal applications taken from production.
>
> I don't have anything against adding a binary interface for this. But
> it's somewhat odd to do ioctls based on /proc files. I wonder if there
> isn't a more suitable place for this. prctl()? New vmstat() system call
> using a pidfd/pid as reference? ioctl() on fs/pidfs.c?

I did ioctl() on /proc/<pid>/maps because that's the file that's used
for the same use cases and it can be opened from other processes for
any target PID. I'm open to any suggestions that make more sense, this
v1 is mostly to start the conversation.

prctl() probably doesn't make sense, as according to man page:

prctl() manipulates various aspects of the behavior of the
calling thread or process.

And this facility is most often used from another (profiler or
symbolizer) process.

New syscall feels like an overkill, but if that's the only way, so be it.

I do like the idea of ioctl() on top of pidfd (I assume that's what
you mean by "fs/pidfs.c", right)? This seems most promising. One
question/nuance. If I understand correctly, pidfd won't hold
task_struct (and its mm_struct) reference, right? So if the process
exits, even if I have pidfd, that task is gone and so we won't be able
to query it. Is that right?

If yes, then it's still workable in a lot of situations, but it would
be nice to have an ability to query VMAs (at least for binary's own
text segments) even if the process exits. This is the case for
short-lived processes that profilers capture some stack traces from,
but by the time these stack traces are processed they are gone.

This might be a stupid idea and question, but what if ioctl() on pidfd
itself would create another FD that would represent mm_struct of that
process, and then we have ioctl() on *that* soft-of-mm-struct-fd to
query VMA. Would that work at all? This approach would allow
long-running profiler application to open pidfd and this other "mm fd"
once, cache it, and then just query it. Meanwhile we can epoll() pidfd
itself to know when the process exits so that these mm_structs are not
referenced for longer than necessary.

Is this pushing too far or you think that would work and be acceptable?

But in any case, I think ioctl() on top of pidfd makes total sense for
this, thanks.