Re: 2.6.14-rc1-git-now still dying in mm/slab - this time line 1849

From: Alok Kataria
Date: Tue Sep 20 2005 - 03:32:37 EST

Next message: Danny ter Haar: "2.6.14-rc1-git5: problem with "nosmp" boot argument"
Previous message: Denis Vlasenko: "Re: CONFIG_PRINTK doesn't makes size smaller"
In reply to: Andrew Morton: "Re: 2.6.14-rc1-git-now still dying in mm/slab - this time line 1849"
Next in thread: Petr Vandrovec: "Re: 2.6.14-rc1-git-now still dying in mm/slab - this time line 1849"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi,

Attached is a patch which stores the numa_node_id in a local variable
after disabling interrupts, in the cache_reap code path.
I was not able to reproduce the bug that Petr was talking about, but if
the cache reap threads do schedule across cpu's which might be the
problem here then this should just fix it.

Andrew, i also have a patch which fixes the CPU_DOWN code path, which i
will send u later.

Thanks & Regards,
Alok.
On Tue, 2005-09-20 at 10:46, Andrew Morton wrote:
> Christoph Lameter <clameter@xxxxxxxxxxxx> wrote:
> >
> > On Mon, 19 Sep 2005, Andrew Morton wrote:
> >
> > > list_for_each(walk, &cache_chain) {
> > > kmem_cache_t *searchp;
> > > struct list_head* p;
> > > int tofree;
> > > struct slab *slabp;
> > >
> > > searchp = list_entry(walk, kmem_cache_t, next);
> > >
> > > if (searchp->flags & SLAB_NO_REAP)
> > > goto next;
> > >
> > > check_irq_on();
> > >
> > > l3 = searchp->nodelists[numa_node_id()];
> > > if (l3->alien)
> > > drain_alien_cache(searchp, l3);
> > > ->preempt here
> > > spin_lock_irq(&l3->list_lock);
> > >
> > > drain_array_locked(searchp, ac_data(searchp), 0,
> > > numa_node_id());
> > > ->oops, wrong node.
> >
> > This is called from keventd which exists per processor. Hmmm... This looks
> > as if it can change processors after all
>
> Well no, it would be a big bug if a keventd thread were to change CPUs.
>
> It's OK to rely upon the pinnedness of keventd I guess - a comment would be
> nice.
>
> > but the slab allocator depends on
> > it running on the right processor. So does the page allocator. sigh. What
> > is the point of having per processor workqueues if they do not stay on
> > the assigned processor?
>
> They do. I don't believe that preemption is the source of this BUG.
> (Petr, does CONFIG_PREEMPT=n fix it?)
>
> > The fast fix for this case is to get the node number once and then use it
> > consistently.
>
> If one is writing preempt-safe code then one should disable preemption
> before copying the current CPU number into a local variable.
>
> > But we really need to audit the slab and page allocator for
> > additional cases like this or disable preempt and check for the right
> > processor in cache_reap().
>
> numa_node_id() must use smp_processor_id(), not raw_smp_processor_id().
> Then all the runtime squawks need to be audited and fixed, or switched to
> (new) raw_numa_node_id() if is is verified that a CPU/node switch at any
> time is OK.
>
--
There are two ways of constructing a software design. One way is to make
it so simple that there are obviously no deficiencies. And the other way
is to make it so complicated that there are no obvious deficiencies.

Signed-off-by: Alok N Kataria <alokk@xxxxxxxxxxxxxx>

Index: linux-2.6.13/mm/slab.c
===================================================================
--- linux-2.6.13.orig/mm/slab.c 2005-09-13 20:56:33.040284000 +0530
+++ linux-2.6.13/mm/slab.c 2005-09-20 13:22:08.328464250 +0530
@@ -3273,7 +3273,7 @@
list_for_each(walk, &cache_chain) {
kmem_cache_t *searchp;
struct list_head* p;
- int tofree;
+ int tofree, nodeid;
struct slab *slabp;

searchp = list_entry(walk, kmem_cache_t, next);
@@ -3283,13 +3283,19 @@

check_irq_on();

- l3 = searchp->nodelists[numa_node_id()];
+ nodeid = numa_node_id();
+ l3 = searchp->nodelists[nodeid];
if (l3->alien)
drain_alien_cache(searchp, l3);
- spin_lock_irq(&l3->list_lock);
+
+ local_irq_disable();
+ nodeid = numa_node_id();
+ l3 = searchp->nodelists[nodeid];
+
+ spin_lock(&l3->list_lock);

drain_array_locked(searchp, ac_data(searchp), 0,
- numa_node_id());
+ nodeid);

if (time_after(l3->next_reap, jiffies))
goto next_unlock;
@@ -3298,7 +3304,7 @@

if (l3->shared)
drain_array_locked(searchp, l3->shared, 0,
- numa_node_id());
+ nodeid);

if (l3->free_touched) {
l3->free_touched = 0;
@@ -3327,7 +3333,8 @@
spin_lock_irq(&l3->list_lock);
} while(--tofree > 0);
next_unlock:
- spin_unlock_irq(&l3->list_lock);
+ spin_unlock(&l3->list_lock);
+ local_irq_enable();
next:
cond_resched();
}

Next message: Danny ter Haar: "2.6.14-rc1-git5: problem with "nosmp" boot argument"
Previous message: Denis Vlasenko: "Re: CONFIG_PRINTK doesn't makes size smaller"
In reply to: Andrew Morton: "Re: 2.6.14-rc1-git-now still dying in mm/slab - this time line 1849"
Next in thread: Petr Vandrovec: "Re: 2.6.14-rc1-git-now still dying in mm/slab - this time line 1849"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]