Re: [PATCH v1 0/8] Deferred dput() and iput() -- reducing lock contention

From: Andi Kleen
Date: Wed Jan 21 2009 - 03:33:19 EST

Next message: Paul Mundt: "Re: [PATCH] dma: fix up broken comparison in dma_alloc_from_coherent"
Previous message: Nick Piggin: "Re: [PATCH] cpuset: fix allocating page cache/slab object on the unallowed node when memory spread is set"
In reply to: Mike Waychison: "Re: [PATCH v1 0/8] Deferred dput() and iput() -- reducing lock contention"
Next in thread: Mike Waychison: "Re: [PATCH v1 0/8] Deferred dput() and iput() -- reducing lock contention"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, Jan 20, 2009 at 10:22:00PM -0800, Mike Waychison wrote:
> Andi Kleen wrote:
> >Mike Waychison <mikew@xxxxxxxxxx> writes:
> >
> >>livelock on dcache_lock/inode_lock (specifically in
> >>atomic_dec_and_lock())
> >
> >I'm not sure how something can livelock in atomic_dec_and_lock which
> >doesn't take a spinlock itself? Are you saying you run into NUMA memory
> >unfairness here? Or did I misparse you?
>
> By atomic_dec_and_lock, I really meant to say _atomic_dec_and_lock().

Ok. So it's basically just the lock that is taken?

In theory one could likely provide an x86 specific dec-and_lock that
might perform better and doesn't lock if the count is still > 0, but that
would only help if the reference count is still > 0. Is that a common
situation in your test?

> It takes the spinlock if the cmpxchg hidden inside atomic_dec_unless fails.
>
> There are likely NUMA unfairness issues at play, but it's not the main
> worry at this point.
>
> >
> >>This patchset is an attempt to try and reduce the locking overheads
> >>associated
> >>with final dput() and final iput(). This is done by batching dentries and
> >>inodes into per-process queues and processing them in 'parallel' to
> >>consolidate
> >>some of the locking.
> >
> >I was wondering what this does to the latencies when dput/iput
> >is only done for very objects. Does it increase costs then
> >significantly?
>
> very objects?

Sorry.

"is only done for very few objects". Somnhow the few got lost.
Basically latency in the unloaded case.

I always worry when people do complicated things for the high
load case how the more usual "do it for a single object" workload
fares.

>
> >
> >As a high level comment it seems like a lot of work to work
> >around global locks, like the inode_lock, where it might be better to
> >just split the lock up? Mind you I don't have a clear proposal
> >how to do that, but surely it's doable somehow.
> >
>
> Perhaps.. the only plausible way I can think this would be doable would
> be to rework the global resources (like the global inode_unused LRU list

One simple way would be to just use multiple lists with an own lock
each. I doubt that would impact the LRU behaviour very much.

> and deal with inode state transitions), but even then, some sort of
> consistency needs to happen at the super_block level,

The sb could also look at multiple lists?

-Andi
--
ak@xxxxxxxxxxxxxxx -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Paul Mundt: "Re: [PATCH] dma: fix up broken comparison in dma_alloc_from_coherent"
Previous message: Nick Piggin: "Re: [PATCH] cpuset: fix allocating page cache/slab object on the unallowed node when memory spread is set"
In reply to: Mike Waychison: "Re: [PATCH v1 0/8] Deferred dput() and iput() -- reducing lock contention"
Next in thread: Mike Waychison: "Re: [PATCH v1 0/8] Deferred dput() and iput() -- reducing lock contention"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]