Re: [PATCH RFC fs] v2 Make sync() satisfy many requests with oneinvocation

From: Dave Chinner
Date: Sat Jul 27 2013 - 02:21:42 EST

Next message: Len Brown: "Re: RFC: revert request for cpuidle patches e11538d1 and 69a37bea"
Previous message: Kalle Valo: "Re: Linux 3.11-rc2 (acpi backlight, revert)"
In reply to: Paul E. McKenney: "Re: [PATCH RFC fs] v2 Make sync() satisfy many requests with oneinvocation"
Next in thread: Paul E. McKenney: "Re: [PATCH RFC fs] v2 Make sync() satisfy many requests with oneinvocation"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Fri, Jul 26, 2013 at 09:05:24PM -0700, Paul E. McKenney wrote:
> On Sat, Jul 27, 2013 at 12:57:03PM +1000, Dave Chinner wrote:
> > On Fri, Jul 26, 2013 at 04:28:52PM -0700, Paul E. McKenney wrote:
> > > Dave Jones reported RCU stalls, overly long hrtimer interrupts, and
> > > amazingly long NMI handlers from a trinity-induced workload involving
> > > lots of concurrent sync() calls (https://lkml.org/lkml/2013/7/23/369).
> > > There are any number of things that one might do to make sync() behave
> > > better under high levels of contention, but it is also the case that
> > > multiple concurrent sync() system calls can be satisfied by a single
> > > sys_sync() invocation.
> > >
> > > Given that this situation is reminiscent of rcu_barrier(), this commit
> > > applies the rcu_barrier() approach to sys_sync(). This approach uses
> > > a global mutex and a sequence counter. The mutex is held across the
> > > sync() operation, which eliminates contention between concurrent sync()
> > > operations.
> > >
> > > The counter is incremented at the beginning and end of
> > > each sync() operation, so that it is odd while a sync() operation is in
> > > progress and even otherwise, just like sequence locks.
> > >
> > > The code that used to be in sys_sync() is now in do_sync(), and sys_sync()
> > > now handles the concurrency. The sys_sync() function first takes a
> > > snapshot of the counter, then acquires the mutex, and then takes another
> > > snapshot of the counter. If the values of the two snapshots indicate that
> > > a full do_sync() executed during the mutex acquisition, the sys_sync()
> > > function releases the mutex and returns ("Our work is done!"). Otherwise,
> > > sys_sync() increments the counter, invokes do_sync(), and increments
> > > the counter again.
> > >
> > > This approach allows a single call to do_sync() to satisfy an arbitrarily
> > > large number of sync() system calls, which should eliminate issues due
> > > to large numbers of concurrent invocations of the sync() system call.
> >
> > This is not addressing the problem that is causing issues during
> > sync. Indeed, it only puts a bandaid over the currently observed
> > trigger.
> >
> > Indeed, i suspect that this will significantly slow down concurrent
> > sync operations, as it serialised sync across all superblocks rather
> > than serialising per-superblock like is currently done. Indeed, that
> > per-superblock serialisation is where all the lock contention
> > problems are. And it's not sync alone that causes the contention
> > problems - it has to be combined with other concurrent workloads
> > that add or remove inodes from the inode cache at tha same time.
>
> Seems like something along the lines of wakeup_flusher_threads()
> currently at the start of sys_sync() would address this.

That's the first thing sync() currently does, but it doesn't solve
the problem because that only triggers WB_SYNC_NONE writeback. ie.
it's not data integrity writeback, so if there's any contention it
will skip pages and inodes. So we still need sync to dispatch IO.

Further, it requires scheduling of the work before anything will be
done, and so if you are CPU loaded then there's a good chance that
the flusher thread doesn't get to a CPU until sync actually queues
the sync work to it and then waits for completion on it...

> > I have patches to address that by removing the source
> > of the lock contention completely, and not just for the sys_sync
> > trigger. Those patches make the problems with concurrent
> > sys_sync operation go away completely for me, not to mention improve
> > performance for 8+ thread metadata workloads on XFS significantly.
> >
> > IOWs, I don't see that concurrent sys_sync operation is a problem at
> > all, and it is actively desirable for systems that have multiple
> > busy filesystems as it allows concurrent dispatch of IO across those
> > multiple filesystems. Serialising all sys_sync work might stop the
> > contention problems, but it will also slow down concurrent sync
> > operations on busy systems as it only allows one thread to dispatch
> > and wait for IO at a time.
> >
> > So, let's not slap a bandaid over a symptom - let's address the
> > cause of the lock contention properly....
>
> Hmmm...
>
> Could you please send your patches over to Dave Jones right now? I am
> getting quite tired of getting RCU CPU stall warning complaints from
> him that turn out to be due to highly contended sync() system calls.

Then ignore them until the code is ready - it'll be 3.12 before the
fixes are merged, anyway, because the lock contention fix requires
infrastructure that is currently in mmotm that is queued for 3.12
(i.e. the per-node list infrastructure) to fix a whole bunch of
other, more critical VFS lock contention problems. Seeing as a new
mmotm went out last week, I should have the patches ready for review
early next week.

FWIW, we (as in XFS filesystem testers) regularly run tests that
have hundreds of concurrent sys_sync() calls running at the same
time. e.g. xfstests::xfs/297 runs a 1000 fsstress processes while
freezing and unfreezing the filesystem, and that usually shows
hundreds of threads running sys_sync concurrently after a short
amount of runtime. So it's pretty clear that what Dave is seeing
is not necessarily representative of what happens when there ar lots
of sys_sync() calls run concurrently.

BTW, concurrent syncfs() calls are going to have exactly the same
problem as concurrent sync() calls, as is any other operation that
results in a walk of the per-superblock inodes list.

Cheers,

Dave.

--
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Len Brown: "Re: RFC: revert request for cpuidle patches e11538d1 and 69a37bea"
Previous message: Kalle Valo: "Re: Linux 3.11-rc2 (acpi backlight, revert)"
In reply to: Paul E. McKenney: "Re: [PATCH RFC fs] v2 Make sync() satisfy many requests with oneinvocation"
Next in thread: Paul E. McKenney: "Re: [PATCH RFC fs] v2 Make sync() satisfy many requests with oneinvocation"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]