Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes

From: Al Viro
Date: Sun May 05 2024 - 15:46:28 EST


On Sat, May 04, 2024 at 08:53:47AM -0700, Linus Torvalds wrote:

> poll_wait
> -> __pollwait
> -> get_file (*boom*)
>
> but the boom is very small because the poll_wait() will be undone by
> poll_freewait(), and normally poll/select has held the file count
> elevated.

Not quite. It's not that poll_wait() calls __pollwait(); it calls
whatever callback that caller of ->poll() has set for it.

__pollwait users (select(2) and poll(2), currently) must (and do) make
sure that refcount is elevated; others (and epoll is not the only one)
need to do whatever's right for their callbacks.

I've no problem with having epoll grab a reference, but if we make that
a universal requirement ->poll() instances can rely upon, we'd better
verify that *all* vfs_poll() are OK. And that ought to go into
Documentation/filesystems/porting.rst ("callers of vfs_poll() must
make sure that file is pinned; ->poll() instances may rely upon that,
but they'd better be very careful about grabbing extra references themselves -
it's acceptable for files on internal mounts, but *NOT* for anything on
mountable filesystems. Any instance that does it needs an explicit
comment telling not to reuse that blindly." or something along those
lines).

Excluding epoll, select/poll and callers that have just done fdget() and will
do fdput() after vfs_poll(), we have this:

drivers/vhost/vhost.c:213: mask = vfs_poll(file, &poll->table);
vhost_poll_start(). Might get interesting... Calls working
with vq->kick as file seem to rely upon vq->mutex, but I'll need to
refresh my memories of that code to check if that's all we need - and
then there's vhost_net_enable_vq(), which also needs an audit.

fs/aio.c:1738: mask = vfs_poll(req->file, &pt) & req->events;
fs/aio.c:1932: mask = vfs_poll(req->file, &apt.pt) & req->events;
aio_poll() and aio_poll_wake() resp. req->file here is actually ->ki_filp
of iocb that contains work as iocb->req.work; it get dropped only in
iocb_destroy(), which also frees iocb. Any call that might've run into
req->file not pinned is already in UAF land.

io_uring/poll.c:303: req->cqe.res = vfs_poll(req->file, &pt) & req->apoll_events;
io_uring/poll.c:622: mask = vfs_poll(req->file, &ipt->pt) & poll->events;
Should have req->file pinned, but I'll need to RTFS a bit for
details. That, or ask Jens to confirm...

net/9p/trans_fd.c:236: ret = vfs_poll(ts->rd, pt);
net/9p/trans_fd.c:238: ret = (ret & ~EPOLLOUT) | (vfs_poll(ts->wr, pt) & ~EPOLLIN);
p9_fd_poll(); ->rd and ->wr are pinned and won't get dropped until
p9_fd_close(), which frees ts immediately afterwards. IOW, if we risk
being called with ->rd or ->wr not pinned, we are in UAF land already.
Incidentally, what the hell is this in p9_fd_open()?
* It's technically possible for userspace or concurrent mounts to
* modify this flag concurrently, which will likely result in a
* broken filesystem. However, just having bad flags here should
* not crash the kernel or cause any other sort of bug, so mark this
* particular data race as intentional so that tooling (like KCSAN)
* can allow it and detect further problems.
*/
Why not simply fix the race instead? As in
spin_lock(&ts->rd->f_lock);
ts->rd->f_flags |= O_NONBLOCK;
spin_unlock(&ts->rd->f_lock);
and similar for ts->wr? Sigh...