Re: [BUG] Lockdep recursive locking in kmem_cache_free

From: Ravikiran G Thirumalai
Date: Fri Jul 28 2006 - 13:08:10 EST


On Fri, Jul 28, 2006 at 07:53:56AM -0700, Christoph Lameter wrote:
> On Fri, 28 Jul 2006, Pekka Enberg wrote:
>
> > > [ 57.976447] [<ffffffff802542fc>] __lock_acquire+0x8cc/0xcb0
> > > [ 57.976562] [<ffffffff80254a02>] lock_acquire+0x52/0x70
> > > [ 57.976675] [<ffffffff8028f201>] kmem_cache_free+0x141/0x210
> > > [ 57.976790] [<ffffffff804a6b74>] _spin_lock+0x34/0x50
> > > [ 57.976903] [<ffffffff8028f201>] kmem_cache_free+0x141/0x210
> > > [ 57.977018] [<ffffffff8028f388>] slab_destroy+0xb8/0xf0
>
> Huh? _spin_lock calls kmem_cache_free?
>
> > cache_reap
> > reap_alien (grabs l3->alien[node]->lock)
> > __drain_alien_cache
> > free_block
> > slab_destroy (slab management off slab)
> > kmem_cache_free
> > __cache_free
> > cache_free_alien (recursive attempt on l3->alien[node] lock)
> >
> > Christoph?
>
> This should not happen. __drain_alien_cache frees node local elements
> thus cache_free_alien should not be called. However, if the slab
> management was allocated on a different node from the slab data then we
> may have an issue. However, both slab managemnt and the slab data are
> allocated on the same node (with alloc_pages_node() and kmalloc_node).

cache_free_alien could get called, but there is no recursion here:

1. reap_alien tries dropping remote objects freed by local node (A) to the
remote node (B) shared array cache (choosing a remote node as indicated by the
node rotor), to do this, it takes the local alien cache lock (A), and calls
__drain_alien_cache. The remote object comes from a slab cache X say.

2. __drain_alien_cache. takes the remote node l3 lock (B), transfers as many
objects as shared array cache of the remote node can hold, and calls
free_block to free remaining objects that could not be dropped in into the
shared array cache of remote node (B). Now free_block is being called from
(A) to free objects on (B).

3. free_block calls slab_destroy for the slab belonging to B. calls
kmem_cache_free for the slab management, which calls __cache_free, and
hence cache_free_alien(). Now since this is being called from A for a local
object of B, the check in cache_free_alien fails, and cache_free_alien
*does* get executed. Since slab management of a slab from B, local to B is
freed from A, A tries to write to the local alien cache corresponding to B,
which comes from a slab cache Y. There is a recursion if X and Y are the
same caches. But that is not a possibility at all, as the off slab management
for a slab cache cannot come from the same slab cache. So this looks like a
false positive from lockdep.

tglx, does the machine boot without lockdep? If yes, then this is a false
positive IMO.

Thanks,
Kiran
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/