Re: [PATCH -v4][RFC]: mutex: implement adaptive spinning

From: Chris Mason
Date: Wed Jan 07 2009 - 12:21:08 EST


On Wed, 2009-01-07 at 08:25 -0800, Linus Torvalds wrote:
>
> On Wed, 7 Jan 2009, Peter Zijlstra wrote:
> >
> > Change mutex contention behaviour such that it will sometimes busy wait on
> > acquisition - moving its behaviour closer to that of spinlocks.
>
> Ok, this one looks _almost_ ok.
>
> The only problem is that I think you've lost the UP case.
>
> In UP, you shouldn't have the code to spin, and the "spin_or_schedule()"
> should fall back to just the schedule case.
>
> It migth also be worthwhile to try to not set the owner, and re-organize
> that a bit (by making it a inline function that sets the owner only for
> CONFIG_SMP or lockdep/debug).

So far I haven't found any btrfs benchmarks where this is slower than
mutexes without any spinning. But, it isn't quite as fast as the btrfs
spin.

I'm using three different benchmarks, and they hammer on different
things. All against btrfs and a single sata drive.

* dbench -t 30 50, which means run dbench 50 for 30 seconds. It is a
broad workload that hammers on lots of code, but it also tends to go
faster when things are less fair. These numbers are stable across runs.
Some IO is done at the very start of the run, but the bulk of the run is
CPU bound in various btrfs btrees.

Plain mutex: dbench reports 240MB/s
Simple spin: dbench reports 560MB/s
Peter's v4: dbench reports 388MB/s

* 50 procs creating 10,000 files (4k each), one dir per proc. The
result is 50 dirs and each dir has 10,000 files. This is mostly CPU
bound for the procs, but pdflush and friends are doing lots of IO.

Plain mutex: avg: 115 files/s avg system time for each proc: 1.6s
Simple spin: avg: 152 files/s avg system time for each proc: 2s
Peter's v4: avg: 130 files/s avg system time for each proc: 2.9s

I would have expected Peter's patch to use less system time than my
spin. If I change his patch to limit the spin to 512 iterations (same
as my code), the system time goes back down to 1.7s, but the files/s
doesn't improve.

* Parallel stat: the last benchmark is the most interesting, since it
really hammers on the btree locking speed. I take the directory tree
the file creation run (50 dirs, 10,000 files each) and have 50 procs
running stat on all the files in parallel.

Before the run I clear the inode cache but leave the page cache. So,
everything is hot in the btree but there are no inodes in cache.

Plain mutex: 9.488s real, 8.6 sys
Simple spin: 3.8s real 13.8s sys
Peter's v4: 7.9s real 8.5s sys

-chris



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/