Re: [PATCH v1 0/8] Deferred dput() and iput() -- reducing lock contention

From: Mike Waychison
Date: Wed Jan 21 2009 - 12:28:47 EST

Andi Kleen wrote:
On Tue, Jan 20, 2009 at 10:22:00PM -0800, Mike Waychison wrote:
Andi Kleen wrote:
Mike Waychison <mikew@xxxxxxxxxx> writes:

livelock on dcache_lock/inode_lock (specifically in atomic_dec_and_lock())
I'm not sure how something can livelock in atomic_dec_and_lock which
doesn't take a spinlock itself? Are you saying you run into NUMA memory
unfairness here? Or did I misparse you?
By atomic_dec_and_lock, I really meant to say _atomic_dec_and_lock().

Ok. So it's basically just the lock that is taken?

In theory one could likely provide an x86 specific dec-and_lock that
might perform better and doesn't lock if the count is still > 0, but that would only help if the reference count is still > 0. Is that a common
situation in your test?

This is precisely what _atomic_dec_and_lock does. It only locks if it looks like the counter will be hitting 0 with this dec.

It is a common case for the count to be > 1, however the troubles we see happen with there are a large number of final dputs (and thus also final iputs)

It takes the spinlock if the cmpxchg hidden inside atomic_dec_unless fails.

There are likely NUMA unfairness issues at play, but it's not the main worry at this point.

This patchset is an attempt to try and reduce the locking overheads associated
with final dput() and final iput(). This is done by batching dentries and
inodes into per-process queues and processing them in 'parallel' to consolidate
some of the locking.
I was wondering what this does to the latencies when dput/iput
is only done for very objects. Does it increase costs then
very objects?


"is only done for very few objects". Somnhow the few got lost.
Basically latency in the unloaded case.

I always worry when people do complicated things for the high
load case how the more usual "do it for a single object" workload

Hmm. I'm not sure. The inline latency should fair out to be better than before on average (common case is to disable pre-emption, enqueue on the postponed list and re-enable pre-emption, which is fast). It's the clearing of the queues that may take a little longer, though the locking overhead should be the same. The trouble is that the queue itself may not be cache hot and that would add a bit of latency.

Would you know of a good way to detect this latency for fewer objects?

As a high level comment it seems like a lot of work to work
around global locks, like the inode_lock, where it might be better to just split the lock up? Mind you I don't have a clear proposal
how to do that, but surely it's doable somehow.

Perhaps.. the only plausible way I can think this would be doable would be to rework the global resources (like the global inode_unused LRU list

One simple way would be to just use multiple lists with an own lock each. I doubt that would impact the LRU behaviour very much.

and deal with inode state transitions), but even then, some sort of consistency needs to happen at the super_block level,

The sb could also look at multiple lists?

Maybe (hashing on inode nr perhaps?).

The other trouble is that dcache_lock which for the most part stands in from of the inode_lock. While developing this patchset, I was originally chasing dcache_lock contention issues. Once I relieved the contention via deferred batching and parallel processing of dput(), I was surprised to see that performance didn't get any better because now I was mostly contending on inode_lock :\

The dcache_lock again could maybe be broken out to be per super_block, but I have no idea how to make that work with mountpoint traversals. Making this guy become a set of smaller locks seems difficult at best.
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at