Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

From: Daniel Colascione
Date: Tue Mar 19 2019 - 18:52:09 EST


On Tue, Mar 19, 2019 at 3:14 PM Christian Brauner <christian@xxxxxxxxxx> wrote:
> So I dislike the idea of allocating new inodes from the procfs super
> block. I would like to avoid pinning the whole pidfd concept exclusively
> to proc. The idea is that the pidfd API will be useable through procfs
> via open("/proc/<pid>") because that is what users expect and really
> wanted to have for a long time. So it makes sense to have this working.
> But it should really be useable without it. That's why translate_pid()
> and pidfd_clone() are on the table. What I'm saying is, once the pidfd
> api is "complete" you should be able to set CONFIG_PROCFS=N - even
> though that's crazy - and still be able to use pidfds. This is also a
> point akpm asked about when I did the pidfd_send_signal work.

I agree that you shouldn't need CONFIG_PROCFS=Y to use pidfds. One
crazy idea that I was discussing with Joel the other day is to just
make CONFIG_PROCFS=Y mandatory and provide a new get_procfs_root()
system call that returned, out of thin air and independent of the
mount table, a procfs root directory file descriptor for the caller's
PID namspace and suitable for use with openat(2).

C'mon: /proc is used by everyone today and almost every program breaks
if it's not around. The string "/proc" is already de facto kernel ABI.
Let's just drop the pretense of /proc being optional and bake it into
the kernel proper, then give programs a way to get to /proc that isn't
tied to any particular mount configuration. This way, we don't need a
translate_pid(), since callers can just use procfs to do the same
thing. (That is, if I understand correctly what translate_pid does.)

We still need a pidfd_clone() for atomicity reasons, but that's a
separate story. My goal is to be able to write a library that
transparently creates and manages a helper child process even in a
"hostile" process environment in which some other uncoordinated thread
is constantly doing a waitpid(-1) (e.g., the JVM).

> So instead of going throught proc we should probably do what David has
> been doing in the mount API and come to rely on anone_inode. So
> something like:
>
> fd = anon_inode_getfd("pidfd", &pidfd_fops, file_priv_data, flags);
>
> and stash information such as pid namespace etc. in a pidfd struct or
> something that we then can stash file->private_data of the new file.
> This also lets us avoid all this open coding done here.
> Another advantage is that anon_inodes is its own kernel-internal
> filesystem.

Sure. That works too.