Re: [locks] 6d390e4b5d: will-it-scale.per_process_ops -96.6% regression

From: Jeff Layton
Date: Fri Mar 13 2020 - 21:11:41 EST


On Fri, 2020-03-13 at 09:19 +1100, NeilBrown wrote:
> On Thu, Mar 12 2020, Jeff Layton wrote:
>
> > On Thu, 2020-03-12 at 15:42 +1100, NeilBrown wrote:
> > > On Wed, Mar 11 2020, Linus Torvalds wrote:
> > >
> > > > On Wed, Mar 11, 2020 at 3:22 PM NeilBrown <neilb@xxxxxxx> wrote:
> > > > > We can combine the two ideas - move the list_del_init() later, and still
> > > > > protect it with the wq locks. This avoids holding the lock across the
> > > > > callback, but provides clear atomicity guarantees.
> > > >
> > > > Ugfh. Honestly, this is disgusting.
> > > >
> > > > Now you re-take the same lock in immediate succession for the
> > > > non-callback case. It's just hidden.
> > > >
> > > > And it's not like the list_del_init() _needs_ the lock (it's not
> > > > currently called with the lock held).
> > > >
> > > > So that "hold the lock over list_del_init()" seems to be horrendously
> > > > bogus. It's only done as a serialization thing for that optimistic
> > > > case.
> > > >
> > > > And that optimistic case doesn't even *want* that kind of
> > > > serialization. It really just wants a "I'm done" flag.
> > > >
> > > > So no. Don't do this. It's mis-using the lock in several ways.
> > > >
> > > > Linus
> > >
> > > It seems that test_and_set_bit_lock() is the preferred way to handle
> > > flags when memory ordering is important, and I can't see how to use that
> > > well with an "I'm done" flag. I can make it look OK with a "I'm
> > > detaching" flag. Maybe this is better.
> > >
> > > NeilBrown
> > >
> > > From f46db25f328ddf37ca9fbd390c6eb5f50c4bd2e6 Mon Sep 17 00:00:00 2001
> > > From: NeilBrown <neilb@xxxxxxx>
> > > Date: Wed, 11 Mar 2020 07:39:04 +1100
> > > Subject: [PATCH] locks: restore locks_delete_lock optimization
> > >
> > > A recent patch (see Fixes: below) removed an optimization which is
> > > important as it avoids taking a lock in a common case.
> > >
> > > The comment justifying the optimisation was correct as far as it went,
> > > in that if the tests succeeded, then the values would remain stable and
> > > the test result will remain valid even without a lock.
> > >
> > > However after the test succeeds the lock can be freed while some other
> > > thread might have only just set ->blocker to NULL (thus allowing the
> > > test to succeed) but has not yet called wake_up() on the wq in the lock.
> > > If the wake_up happens after the lock is freed, a use-after-free error occurs.
> > >
> > > This patch restores the optimization and adds a flag to ensure this
> > > use-after-free is avoid. The use happens only when the flag is set, and
> > > the free doesn't happen until the flag has been cleared, or we have
> > > taken blocked_lock_lock.
> > >
> > > Fixes: 6d390e4b5d48 ("locks: fix a potential use-after-free problem when wakeup a waiter")
> > > Signed-off-by: NeilBrown <neilb@xxxxxxx>
> > > ---
> > > fs/locks.c | 44 ++++++++++++++++++++++++++++++++++++++------
> > > include/linux/fs.h | 3 ++-
> > > 2 files changed, 40 insertions(+), 7 deletions(-)
> > >
> >
> > Just a note that I'm traveling at the moment, and won't be able do much
> > other than comment on this for a few days.
> >
> > > diff --git a/fs/locks.c b/fs/locks.c
> > > index 426b55d333d5..334473004c6c 100644
> > > --- a/fs/locks.c
> > > +++ b/fs/locks.c
> > > @@ -283,7 +283,7 @@ locks_dump_ctx_list(struct list_head *list, char *list_type)
> > > struct file_lock *fl;
> > >
> > > list_for_each_entry(fl, list, fl_list) {
> > > - pr_warn("%s: fl_owner=%p fl_flags=0x%x fl_type=0x%x fl_pid=%u\n", list_type, fl->fl_owner, fl->fl_flags, fl->fl_type, fl->fl_pid);
> > > + pr_warn("%s: fl_owner=%p fl_flags=0x%lx fl_type=0x%x fl_pid=%u\n", list_type, fl->fl_owner, fl->fl_flags, fl->fl_type, fl->fl_pid);
> > > }
> > > }
> > >
> > > @@ -314,7 +314,7 @@ locks_check_ctx_file_list(struct file *filp, struct list_head *list,
> > > list_for_each_entry(fl, list, fl_list)
> > > if (fl->fl_file == filp)
> > > pr_warn("Leaked %s lock on dev=0x%x:0x%x ino=0x%lx "
> > > - " fl_owner=%p fl_flags=0x%x fl_type=0x%x fl_pid=%u\n",
> > > + " fl_owner=%p fl_flags=0x%lx fl_type=0x%x fl_pid=%u\n",
> > > list_type, MAJOR(inode->i_sb->s_dev),
> > > MINOR(inode->i_sb->s_dev), inode->i_ino,
> > > fl->fl_owner, fl->fl_flags, fl->fl_type, fl->fl_pid);
> > > @@ -736,10 +736,13 @@ static void __locks_wake_up_blocks(struct file_lock *blocker)
> > > waiter = list_first_entry(&blocker->fl_blocked_requests,
> > > struct file_lock, fl_blocked_member);
> > > __locks_delete_block(waiter);
> > > - if (waiter->fl_lmops && waiter->fl_lmops->lm_notify)
> > > - waiter->fl_lmops->lm_notify(waiter);
> > > - else
> > > - wake_up(&waiter->fl_wait);
> > > + if (!test_and_set_bit_lock(FL_DELETING, &waiter->fl_flags)) {
> > > + if (waiter->fl_lmops && waiter->fl_lmops->lm_notify)
> > > + waiter->fl_lmops->lm_notify(waiter);
> > > + else
> > > + wake_up(&waiter->fl_wait);
> > > + clear_bit_unlock(FL_DELETING, &waiter->fl_flags);
> > > + }
> >
> > I *think* this is probably safe.
> >
> > AIUI, when you use atomic bitops on a flag word like this, you should
> > use them for all modifications to ensure that your changes don't get
> > clobbered by another task racing in to do a read/modify/write cycle on
> > the same word.
> >
> > I haven't gone over all of the places where fl_flags is changed, but I
> > don't see any at first glance that do it on a blocked request.
> >
> > > }
> > > }
> > >
> > > @@ -753,11 +756,40 @@ int locks_delete_block(struct file_lock *waiter)
> > > {
> > > int status = -ENOENT;
> > >
> > > + /*
> > > + * If fl_blocker is NULL, it won't be set again as this thread
> > > + * "owns" the lock and is the only one that might try to claim
> > > + * the lock. So it is safe to test fl_blocker locklessly.
> > > + * Also if fl_blocker is NULL, this waiter is not listed on
> > > + * fl_blocked_requests for some lock, so no other request can
> > > + * be added to the list of fl_blocked_requests for this
> > > + * request. So if fl_blocker is NULL, it is safe to
> > > + * locklessly check if fl_blocked_requests is empty. If both
> > > + * of these checks succeed, there is no need to take the lock.
> > > + *
> > > + * We perform these checks only if we can set FL_DELETING.
> > > + * This ensure that we don't race with __locks_wake_up_blocks()
> > > + * in a way which leads it to call wake_up() *after* we return
> > > + * and the file_lock is freed.
> > > + */
> > > + if (!test_and_set_bit_lock(FL_DELETING, &waiter->fl_flags)) {
> > > + if (waiter->fl_blocker == NULL &&
> > > + list_empty(&waiter->fl_blocked_requests)) {
> > > + /* Already fully unlinked */
> > > + clear_bit_unlock(FL_DELETING, &waiter->fl_flags);
> > > + return status;
> > > + }
> > > + }
> > > +
> > > spin_lock(&blocked_lock_lock);
> > > if (waiter->fl_blocker)
> > > status = 0;
> > > __locks_wake_up_blocks(waiter);
> > > __locks_delete_block(waiter);
> > > + /* This flag might not be set and it is largely irrelevant
> > > + * now, but it seem cleaner to clear it.
> > > + */
> > > + clear_bit(FL_DELETING, &waiter->fl_flags);
> > > spin_unlock(&blocked_lock_lock);
> > > return status;
> > > }
> > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > index 3cd4fe6b845e..4db514f29bca 100644
> > > --- a/include/linux/fs.h
> > > +++ b/include/linux/fs.h
> > > @@ -1012,6 +1012,7 @@ static inline struct file *get_file(struct file *f)
> > > #define FL_UNLOCK_PENDING 512 /* Lease is being broken */
> > > #define FL_OFDLCK 1024 /* lock is "owned" by struct file */
> > > #define FL_LAYOUT 2048 /* outstanding pNFS layout */
> > > +#define FL_DELETING 32768 /* lock is being disconnected */
> >
> > nit: Why the big gap?
>
> No good reason - it seems like a conceptually different sort of flag so
> I vaguely felt that it would help if it were numerically separate.
>
> > > #define FL_CLOSE_POSIX (FL_POSIX | FL_CLOSE)
> > >
> > > @@ -1087,7 +1088,7 @@ struct file_lock {
> > > * ->fl_blocker->fl_blocked_requests
> > > */
> > > fl_owner_t fl_owner;
> > > - unsigned int fl_flags;
> > > + unsigned long fl_flags;
> >
> > This will break kABI, so backporting this to enterprise distro kernels
> > won't be trivial. Not a showstopper, but it might be nice to avoid that
> > if we can.
> >
> > While it's not quite as efficient, we could just do the FL_DELETING
> > manipulation under the flc->flc_lock. That's per-inode, so it should be
> > safe to do it that way.
>
> If we are going to use a spinlock, I'd much rather not add a flag bit,
> but instead use the blocked_member list_head.
>

If we do want to go that route though, we'll probably need to make
variants of locks_delete_block that can be called with the flc_lock
held and without. Most of the fs/locks.c callers call it with the
flc_lock held -- most of the others don't.

> I'm almost tempted to suggest adding
> smp_list_del_init_release() and smp_list_empty_careful_acquire()
> so that list membership can be used as a barrier. I'm not sure I game
> though.
>

Those do sound quite handy to have, but I'm not sure it's really
required. We could also just go back to considering the patch that
Linus sent originally, along with changing all of the
wait_event_interruptible calls to use
list_empty(&fl->fl_blocked_member) instead of !fl->fl_blocker as the
condition. (See attached)

--
Jeff Layton <jlayton@xxxxxxxxxx>
From 32477da99f429d204f97afef297bbc3c198bb360 Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
Date: Mon, 9 Mar 2020 14:35:43 -0400
Subject: [PATCH] locks: reinstate locks_delete_lock optimization

There is measurable performance impact in some synthetic tests in commit
6d390e4b5d48 (locks: fix a potential use-after-free problem when wakeup
a waiter). Fix the race condition instead by clearing the fl_blocker
pointer after the wakeup and by using smp_load_acquire and
smp_store_release to handle the access.

This means that we can no longer use the clearing of fl_blocker clearing
as the wait condition, so switch over to checking whether the
fl_blocked_member list is empty.

[ jlayton: wait on the fl_blocked_requests list to go empty instead of
the fl_blocker pointer to clear. ]

Cc: yangerkun <yangerkun@xxxxxxxxxx>
Cc: NeilBrown <neilb@xxxxxxx>
Fixes: 6d390e4b5d48 (locks: fix a potential use-after-free problem when wakeup a waiter)
Signed-off-by: Jeff Layton <jlayton@xxxxxxxxxx>
---
fs/cifs/file.c | 3 ++-
fs/locks.c | 43 +++++++++++++++++++++++++++++++++++++------
2 files changed, 39 insertions(+), 7 deletions(-)

diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 3b942ecdd4be..8f9d849a0012 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -1169,7 +1169,8 @@ cifs_posix_lock_set(struct file *file, struct file_lock *flock)
rc = posix_lock_file(file, flock, NULL);
up_write(&cinode->lock_sem);
if (rc == FILE_LOCK_DEFERRED) {
- rc = wait_event_interruptible(flock->fl_wait, !flock->fl_blocker);
+ rc = wait_event_interruptible(flock->fl_wait,
+ list_empty(&flock->fl_blocked_member));
if (!rc)
goto try_again;
locks_delete_block(flock);
diff --git a/fs/locks.c b/fs/locks.c
index 426b55d333d5..e78d37c73df5 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -725,7 +725,6 @@ static void __locks_delete_block(struct file_lock *waiter)
{
locks_delete_global_blocked(waiter);
list_del_init(&waiter->fl_blocked_member);
- waiter->fl_blocker = NULL;
}

static void __locks_wake_up_blocks(struct file_lock *blocker)
@@ -740,6 +739,12 @@ static void __locks_wake_up_blocks(struct file_lock *blocker)
waiter->fl_lmops->lm_notify(waiter);
else
wake_up(&waiter->fl_wait);
+
+ /*
+ * Tell the world we're done with it - see comment at
+ * top of locks_delete_block().
+ */
+ smp_store_release(&waiter->fl_blocker, NULL);
}
}

@@ -753,11 +758,32 @@ int locks_delete_block(struct file_lock *waiter)
{
int status = -ENOENT;

+ /*
+ * If fl_blocker is NULL, it won't be set again as this thread
+ * "owns" the lock and is the only one that might try to claim
+ * the lock. So it is safe to test fl_blocker locklessly.
+ * Also if fl_blocker is NULL, this waiter is not listed on
+ * fl_blocked_requests for some lock, so no other request can
+ * be added to the list of fl_blocked_requests for this
+ * request. So if fl_blocker is NULL, it is safe to
+ * locklessly check if fl_blocked_requests is empty. If both
+ * of these checks succeed, there is no need to take the lock.
+ */
+ if (!smp_load_acquire(&waiter->fl_blocker) &&
+ list_empty(&waiter->fl_blocked_requests))
+ return status;
+
spin_lock(&blocked_lock_lock);
if (waiter->fl_blocker)
status = 0;
__locks_wake_up_blocks(waiter);
__locks_delete_block(waiter);
+
+ /*
+ * Tell the world we're done with it - see comment at top
+ * of this function
+ */
+ smp_store_release(&waiter->fl_blocker, NULL);
spin_unlock(&blocked_lock_lock);
return status;
}
@@ -1350,7 +1376,8 @@ static int posix_lock_inode_wait(struct inode *inode, struct file_lock *fl)
error = posix_lock_inode(inode, fl, NULL);
if (error != FILE_LOCK_DEFERRED)
break;
- error = wait_event_interruptible(fl->fl_wait, !fl->fl_blocker);
+ error = wait_event_interruptible(fl->fl_wait,
+ list_empty(&fl->fl_blocked_member));
if (error)
break;
}
@@ -1435,7 +1462,8 @@ int locks_mandatory_area(struct inode *inode, struct file *filp, loff_t start,
error = posix_lock_inode(inode, &fl, NULL);
if (error != FILE_LOCK_DEFERRED)
break;
- error = wait_event_interruptible(fl.fl_wait, !fl.fl_blocker);
+ error = wait_event_interruptible(fl.fl_wait,
+ list_empty(&fl.fl_blocked_member));
if (!error) {
/*
* If we've been sleeping someone might have
@@ -1638,7 +1666,8 @@ int __break_lease(struct inode *inode, unsigned int mode, unsigned int type)

locks_dispose_list(&dispose);
error = wait_event_interruptible_timeout(new_fl->fl_wait,
- !new_fl->fl_blocker, break_time);
+ list_empty(&new_fl->fl_blocked_member),
+ break_time);

percpu_down_read(&file_rwsem);
spin_lock(&ctx->flc_lock);
@@ -2122,7 +2151,8 @@ static int flock_lock_inode_wait(struct inode *inode, struct file_lock *fl)
error = flock_lock_inode(inode, fl);
if (error != FILE_LOCK_DEFERRED)
break;
- error = wait_event_interruptible(fl->fl_wait, !fl->fl_blocker);
+ error = wait_event_interruptible(fl->fl_wait,
+ list_empty(&fl->fl_blocked_member));
if (error)
break;
}
@@ -2399,7 +2429,8 @@ static int do_lock_file_wait(struct file *filp, unsigned int cmd,
error = vfs_lock_file(filp, cmd, fl, NULL);
if (error != FILE_LOCK_DEFERRED)
break;
- error = wait_event_interruptible(fl->fl_wait, !fl->fl_blocker);
+ error = wait_event_interruptible(fl->fl_wait,
+ list_empty(&fl->fl_blocked_member));
if (error)
break;
}
--
2.24.1