Re: linux-next: slab shrinkers: BUG at mm/list_lru.c:92

From: Glauber Costa
Date: Mon Jun 17 2013 - 18:30:19 EST


On Mon, Jun 17, 2013 at 02:35:08PM -0700, Andrew Morton wrote:
> On Mon, 17 Jun 2013 19:14:12 +0400 Glauber Costa <glommer@xxxxxxxxx> wrote:
>
> > > I managed to trigger:
> > > [ 1015.776029] kernel BUG at mm/list_lru.c:92!
> > > [ 1015.776029] invalid opcode: 0000 [#1] SMP
> > > with Linux next (next-20130607) with https://lkml.org/lkml/2013/6/17/203
> > > on top.
> > >
> > > This is obviously BUG_ON(nlru->nr_items < 0) and
> > > ffffffff81122d0b: 48 85 c0 test %rax,%rax
> > > ffffffff81122d0e: 49 89 44 24 18 mov %rax,0x18(%r12)
> > > ffffffff81122d13: 0f 84 87 00 00 00 je ffffffff81122da0 <list_lru_walk_node+0x110>
> > > ffffffff81122d19: 49 83 7c 24 18 00 cmpq $0x0,0x18(%r12)
> > > ffffffff81122d1f: 78 7b js ffffffff81122d9c <list_lru_walk_node+0x10c>
> > > [...]
> > > ffffffff81122d9c: 0f 0b ud2
> > >
> > > RAX is -1UL.
> > Yes, fearing those kind of imbalances, we decided to leave the counter as a signed quantity
> > and BUG, instead of an unsigned quantity.
> >
> > >
> > > I assume that the current backtrace is of no use and it would most
> > > probably be some shrinker which doesn't behave.
> > >
> > There are currently 3 users of list_lru in tree: dentries, inodes and xfs.
> > Assuming you are not using xfs, we are left with dentries and inodes.
> >
> > The first thing to do is to find which one of them is misbehaving. You can try finding
> > this out by the address of the list_lru, and where it lays in the superblock.
> >
> > Once we know each of them is misbehaving, then we'll have to figure out why.
>
> The trace says shrink_slab_node->super_cache_scan->prune_icache_sb. So
> it's inodes?
>
Assuming there is no memory corruption of any sort going on , let's check the code.
nr_item is only manipulated in 3 places:

1) list_lru_add, where it is increased
2) list_lru_del, where it is decreased in case the user have voluntarily removed the
element from the list
3) list_lru_walk_node, where an element is removing during shrink.

All three excerpts seem to be correctly locked, so something like this indicates an imbalance.
Either the element was never added to the list, or it was added, removed, and we didn't notice
it. (Again, your backing storage is not XFS, is it? If it is , we have another user to look for)

I will assume that Andrew is correct and this is inode related. list_lru_del reads as follows:
spin_lock(&nlru->lock);
if (!list_empty(item)) { ... }

So one possibility is that we are manipulating this list outside this lock somewhere. Going to
inode.c... We always manipulate the LRU inside the lock, but the element is not always in the LRU,
if it is in a list. Could it be possible that the element is in the dispose_list, and at the same
time someone calls list_lru_del at it, creating the imbalance ?

callers:
iput_final, evict_inodes, invalidate_inodes.
Both evict_inodes and invalidate_inodes will do the following pattern:

inode->i_state |= I_FREEING;
inode_lru_list_del(inode);
spin_unlock(&inode->i_lock);
list_add(&inode->i_lru, &dispose);

IOW, they will remove the element from the LRU, and add it to the dispose list.
Both of them will also bail out if they see I_FREEING already set, so they are safe
against each other - because the flag is manipulated inside the lock.

But how about iput_final? It seems to me that if we are calling iput_final at the
same time as the other two, this *could* happen (maybe there is some extra protection
that can be seen from Australia but not from here. Dave?)

Right now this is my best theory.

I am attaching a patch that should make a difference in case I am right.




diff --git a/fs/inode.c b/fs/inode.c
index 00b804e..c46c92e 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -419,6 +419,8 @@ void inode_add_lru(struct inode *inode)

static void inode_lru_list_del(struct inode *inode)
{
+ if (inode->i_state & I_FREEING)
+ return;

if (list_lru_del(&inode->i_sb->s_inode_lru, &inode->i_lru))
this_cpu_dec(nr_unused);
@@ -1381,9 +1383,8 @@ static void iput_final(struct inode *inode)
inode->i_state &= ~I_WILL_FREE;
}

+ inode_lru_list_del(inode);
inode->i_state |= I_FREEING;
- if (!list_empty(&inode->i_lru))
- inode_lru_list_del(inode);
spin_unlock(&inode->i_lock);

evict(inode);