Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes

From: Christian Brauner
Date: Mon May 06 2024 - 10:22:18 EST


On Mon, May 06, 2024 at 11:27:04AM +0200, Christian Brauner wrote:
> On Mon, May 06, 2024 at 10:45:35AM +0200, Christian Brauner wrote:
> > > The fact is, it's not dma-buf that is violating any rules. It's epoll.
> >
> > I agree that epoll() not taking a reference on the file is at least
> > unexpected and contradicts the usual code patterns for the sake of
> > performance and that it very likely is the case that most callers of
> > f_op->poll() don't know this.
> >
> > Note, I cleary wrote upthread that I'm ok to do it like you suggested
> > but raised two concerns a) there's currently only one instance of
> > prolonged @file lifetime in f_op->poll() afaict and b) that there's
> > possibly going to be some performance impact on epoll().
> >
> > So it's at least worth discussing what's more important because epoll()
> > is very widely used and it's not that we haven't favored performance
> > before.
> >
> > But you've already said that you aren't concerned with performance on
> > epoll() upthread. So afaict then there's really not a lot more to
> > discuss other than take the patch and see whether we get any complaints.
>
> Two closing thoughts:
>
> (1) I wonder if this won't cause userspace regressions for the semantics
> of epoll because dying files are now silently ignored whereas before
> they'd generated events.
>
> (2) The other part is that this seems to me that epoll() will now
> temporarly pin filesystems opening up the possibility for spurious
> EBUSY errors.
>
> If you register a file descriptor in an epoll instance and then
> close it and umount the filesystem but epoll managed to do an fget()
> on that fd before that close() call then epoll will pin that
> filesystem.
>
> If the f_op->poll() method does something that can take a while
> (blocks on a shared mutex of that subsystem) that umount is very
> likely going to return EBUSY suddenly.
>
> Afaict, before that this wouldn't have been an issue at all and is
> likely more serious than performance.
>
> (One option would be to only do epi_fget() for stuff like
> dma-buf that's never unmounted. That'll cover nearly every
> driver out there. Only "real" filesystems would have to contend with
> @file count going to zero but honestly they also deal with dentry
> lookup under RCU which is way more adventurous than this.)
>
> Maybe I'm barking up the wrong tree though.

Sorry, had to step out for an appointment.

Under the assumption that I'm not entirely off with this - and I really
could be ofc - then one possibility would be that we enable persistence
of files from f_op->poll() for SB_NOUSER filesystems.

That'll catch everything that's relying on anonymous inodes (drm and all
drivers) and init_pseudo() so everything that isn't actually unmountable
(pipefs, pidfs, sockfs, etc.).

So something like the _completely untested_ diff on top of your proposal
above:

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index a3f0f868adc4..95968a462544 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1018,8 +1018,24 @@ static struct file *epi_fget(const struct epitem *epi)
static __poll_t ep_item_poll(const struct epitem *epi, poll_table *pt,
int depth)
{
- struct file *file = epi_fget(epi);
+ struct file *file = epi->ffd.file;
__poll_t res;
+ bool unrefd = false;
+
+ /*
+ * Taking a reference for anything that isn't mountable is fine
+ * because we don't have to worry about spurious EBUSY warnings
+ * from umount().
+ *
+ * File count might go to zero in f_op->poll() for mountable
+ * filesystems.
+ */
+ if (file->f_inode->i_sb->s_flags & SB_NOUSER) {
+ unrefd = true;
+ file = epi_fget(epi);
+ } else if (file_count(file) == 0) {
+ file = NULL;
+ }

/*
* We could return EPOLLERR | EPOLLHUP or something,
@@ -1034,7 +1050,9 @@ static __poll_t ep_item_poll(const struct epitem *epi, poll_table *pt,
res = vfs_poll(file, pt);
else
res = __ep_eventpoll_poll(file, pt, depth);
- fput(file);
+
+ if (unrefd)
+ fput(file);
return res & epi->event.events;
}

Basically, my worry is that we end up with really annoying to debug
EBUSYs caused by epoll(). I'd really like to avoid that. But again, I
might be wrong and this isn't an issue.