Re: [PATCH] mm: use only pidfd for process_madvise syscall

From: Christian Brauner
Date: Tue May 19 2020 - 03:45:32 EST


On Fri, May 15, 2020 at 06:20:55PM -0700, Minchan Kim wrote:
> Based on discussion[1], people didn't feel we need to support both
> pid and pidfd for every new coming API[2] so this patch keeps only
> pidfd. This patch also changes flags's type with "unsigned int".
> So finally, the API is as follows,
>
> ssize_t process_madvise(int pidfd, const struct iovec *iovec,
> unsigned long vlen, int advice, unsigned int flags);
>
> DESCRIPTION
> The process_madvise() system call is used to give advice or directions
> to the kernel about the address ranges from external process as well as
> local process. It provides the advice to address ranges of process
> described by iovec and vlen. The goal of such advice is to improve system
> or application performance.
>
> The pidfd selects the process referred to by the PID file descriptor
> specified in pidfd. (See pidofd_open(2) for further information)
>
> The pointer iovec points to an array of iovec structures, defined in
> <sys/uio.h> as:
>
> struct iovec {
> void *iov_base; /* starting address */
> size_t iov_len; /* number of bytes to be advised */
> };
>
> The iovec describes address ranges beginning at address(iov_base)
> and with size length of bytes(iov_len).
>
> The vlen represents the number of elements in iovec.
>
> The advice is indicated in the advice argument, which is one of the
> following at this moment if the target process specified by idtype and
> id is external.
>
> MADV_COLD
> MADV_PAGEOUT
> MADV_MERGEABLE
> MADV_UNMERGEABLE
>
> Permission to provide a hint to external process is governed by a
> ptrace access mode PTRACE_MODE_ATTACH_FSCREDS check; see ptrace(2).
>
> The process_madvise supports every advice madvise(2) has if target
> process is in same thread group with calling process so user could
> use process_madvise(2) to extend existing madvise(2) to support
> vector address ranges.
>
> RETURN VALUE
> On success, process_madvise() returns the number of bytes advised.
> This return value may be less than the total number of requested
> bytes, if an error occurred. The caller should check return value
> to determine whether a partial advice occurred.
>
> [1] https://lore.kernel.org/linux-mm/20200509124817.xmrvsrq3mla6b76k@wittgenstein/
> [2] https://lore.kernel.org/linux-mm/9d849087-3359-c4ab-fbec-859e8186c509@xxxxxxxxxxxxx/
> Signed-off-by: Minchan Kim <minchan@xxxxxxxxxx>

Thanks for the ping Minchan, and sorry for not replying earlier to this.

Also, sorry that i delayed the patch but this here really seems a way
cleaner api to me and feels less hackish. In general this patch seems
fine to me.
But two comments below:

> ---
> mm/madvise.c | 42 +++++++++++++-----------------------------
> 1 file changed, 13 insertions(+), 29 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index d3fbbe52d230..35c9b220146a 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1229,8 +1229,8 @@ static int process_madvise_vec(struct task_struct *target_task,
> return ret;
> }
>
> -static ssize_t do_process_madvise(int which, pid_t upid, struct iov_iter *iter,
> - int behavior, unsigned long flags)
> +static ssize_t do_process_madvise(int pidfd, struct iov_iter *iter,
> + int behavior, unsigned int flags)
> {
> ssize_t ret;
> struct pid *pid;
> @@ -1241,26 +1241,12 @@ static ssize_t do_process_madvise(int which, pid_t upid, struct iov_iter *iter,
> if (flags != 0)
> return -EINVAL;
>
> - switch (which) {
> - case P_PID:
> - if (upid <= 0)
> - return -EINVAL;
> -
> - pid = find_get_pid(upid);
> - if (!pid)
> - return -ESRCH;
> - break;
> - case P_PIDFD:
> - if (upid < 0)
> - return -EINVAL;
> -
> - pid = pidfd_get_pid(upid);
> - if (IS_ERR(pid))
> - return PTR_ERR(pid);
> - break;
> - default:
> + if (pidfd < 0)
> return -EINVAL;

When garbage file descriptors are passed EBADF needs to be returned, not
EINVAL. That's the case with most apis and also with pidfds, compare:

int main(void)
{

if (syscall(__NR_openat, -1, "tmp", O_RDONLY) < 0)
printf("%m - openat(-1)\n");

if (syscall(__NR_fcntl, -1, F_GETFL) < 0)
printf("%m - fcntl(-1)\n");

if (syscall(__NR_dup, -1) < 0)
printf("%m - dup(-1)\n");

if (syscall(__NR_close, -1) < 0)
printf("%m - close(-1)\n");

if (syscall(__NR_pidfd_getfd, -1, 0) < 0)
printf("%m - pidfd_getfd(-1)\n");

if (syscall(__NR_pidfd_send_signal, -1, 0, NULL, 0) < 0)
printf("%m - pidfd_getfd(-1)\n");

exit(EXIT_SUCCESS);
}

which all give:

Bad file descriptor - openat(-1)
Bad file descriptor - fcntl(-1)
Bad file descriptor - dup(-1)
Bad file descriptor - close(-1)
Bad file descriptor - pidfd_getfd(-1)
Bad file descriptor - pidfd_send_signal(-1)

In addition, I have one more request/proposal. I really think that
consistent api naming is something we should consider so I'd propose: