Re: [PATCH v2] writeback: avoid race when update bandwidth

From: Dave Chinner
Date: Wed Jun 13 2012 - 22:06:12 EST


On Wed, Jun 13, 2012 at 08:14:34PM +0800, Fengguang Wu wrote:
> On Wed, Jun 13, 2012 at 01:59:20PM +1000, Dave Chinner wrote:
> > On Tue, Jun 12, 2012 at 07:52:19PM +0800, Fengguang Wu wrote:
> > > On Tue, Jun 12, 2012 at 07:46:01PM +0800, Wanpeng Li wrote:
> > > > From: Wanpeng Li <liwp@xxxxxxxxxxxxxxxxxx>
> > > >
> > > > "V1 -> V2"
> > > > * remove dirty_lock
> > > >
> > > > Since bdi->wb.list_lock is used to protect the b_* lists,
> > > > so the flushers who call wb_writeback to writeback pages will
> > > > stuck when bandwidth update policy holds this lock. In order
> > > > to avoid this race we can introduce a new bandwidth_lock who
> > > > is responsible for protecting bandwidth update policy.
> > > >
> > > > Signed-off-by: Wanpeng Li <liwp.linux@xxxxxxxxx>
> > >
> > > Applied with a new title "writeback: use a standalone lock for
> > > updating write bandwidth". "race" is sensitive because it often
> > > refers to some locking error.
> >
> > Fengguang - can we get some evidence that this is a contended lock
> > before changing the scope of it? All of the previous "breaking up
> > global locks" have been done based on lock contention data, so
> > moving back to a global lock for this needs to have the same
> > analysis provided...
>
> Good point. Attached is the lockstat for the case "10 disks each runs
> 100 dd dirtier tasks":
>
> lkp-ne02/JBOD-10HDD-thresh=4G/xfs-100dd-1-3.2.0-rc5

(nothing attached)

> The wb->list_lock contention is much better than I expected, which is
> good. What stand out are
> waittime-total
> - &rq->lock by double_rq_lock() 6738952.13
> - clockevents_lock by clockevents_notify() 2155554.37
> - mapping->tree_lock by test_clear_page_writeback() 931550.13
> - sb_lock by grab_super_passive() 918815.87
> - &zone->lru_lock by pagevec_lru_move_fn() 912681.05
>
> - sysfs_mutex by sysfs_permission() 24029975.20 # mutex
> - ip->i_lock by xfs_ilock() 18428284.10 # mrlock

The wait time is not really an indication of contention problems.
Large wait time is usually an indication that the lock is being used
a lot.

What matters is the number of contentions vs the number of
acquisitions, and the number of those contentions that bounced the
lock. If the number of contentions is >= 0.5% of the acquisitions,
then the lock can be considered hot and needing some work. If I look
here:

http://lists.linux.hp.com/~enw/ext4/3.2/3.2-full-lockstats.2/ffsb_fsscale.xfs.large_file_creates_threads=192/profiling/iteration.1/lock_stat

Which is a 192 thread concurrent write on a 48-core machine, the
wb.list_lock shows 5,532 acquistions for the entire test, while the
mapping tree lock took 440 million!. So your test isn't really one
that shows wb.list_lock contention. The 192-thread mailserver
workload from the same machine:

http://lists.linux.hp.com/~enw/ext4/3.2/3.2-full-lockstats.2/ffsb_fsscale.xfs.mail_server_threads=192/profiling/iteration.1/lock_stat

Shows about 7.1m acquisitions of the wb.list_lock, but only 28,000
contentions. So it isn't really contended enough to justify
replacing it with a global lock.

FWIW, the third most contended lock on that workload is the XFS
delayed write queue lock - 25M acquisitions for 600k contentions - a
rate of about 2% which means quite severe contention. That lock no
longer exists in 3.5 - Christoph completely reworked the delayed
write buffer support to remove the global list and lock because it
was showing up in profiles like this...

Indeed, that profile shows that XFS owns 7 of the 10 most contended
locks, and 3 of them have had significant work done to reduce the
contention since 3.2 as a result of recent profile results like this.

Cheers,

Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/