Re: [PATCH 00/49] Automatic NUMA Balancing v10

From: Ingo Molnar
Date: Mon Dec 17 2012 - 05:33:48 EST

* Mel Gorman <mgorman@xxxxxxx> wrote:

> > > [...] Holding PTL across task_numa_fault is bad, but not
> > > the bad we're looking for.
> >
> > No, holding the PTL across task_numa_fault() is fine,
> > because this bit got reworked in my tree rather
> > significantly, see:
> >
> > 6030a23a1c66 sched: Move the NUMA placement logic to a
> > worklet
> >
> > and followup patches.
> I believe I see your point. After that patch is applied
> task_numa_fault() is a relatively small function and is no
> longer calling task_numa_placement. Sure, PTL is held longer
> than necessary but not enough to cause real scalability
> issues.

Yes - my motivation for that was three-fold:

1) to push rebalancing into process context and thus make it
essentially lockless and also potentially preemptable.

2) enable the flip-tasks logic, which relies on taking a
balancing decision and acting on it immediately. If you are
in process context then this is doable. If you are in a
balancing irq context then not so much.

3) to simplify the 2M-emu loop was extra dressing on the cake:
instead of taking and dropping the PTL 512 times (possibly
interleaving two threads on the same pmd, both of them
taking/dropping the same set of locks?), it only takes the
ptl once.

I'll revive this aspect, it has many positives.

> > > If the bug is indeed here, it's not obvious. I don't know
> > > why I'm triggering it or why it only triggers for specjbb
> > > as I cannot imagine what the JVM would be doing that is
> > > that weird or that would not have triggered before. Maybe
> > > we both suffer this type of problem but that numacores
> > > rate of migration is able to trigger it.
> >
> > Agreed.
> I spent some more time on this today and the bug is *really*
> hard to trigger or at least I have been unable to trigger it
> today. This begs the question why it triggered three times in
> relatively quick succession separated by a few hours when
> testing numacore on Dec 9th. Other tests ran between the
> failures. The first failure results were discarded. I deleted
> them to see if the same test reproduced it a second time (it
> did).
> Of the three times this bug triggered in the last week, two
> were unclear where they crashed but one showed that the bug
> was triggered by the JVMs garbage collector. That at least is
> a corner case and might explain why it's hard to trigger.
> I feel extremely bad about how I reported this because even
> though we differ in how we handle faults, I really cannot see
> any difference that would explain this and I've looked long
> enough. Triggering this by the kernel would *have* to be
> something like missing a cache or TLB flush after page tables
> have been modified or during migration but in most way that
> matters we share that logic. Where we differ, it shouldn't
> matter.

Don't worry, I really think you reported a genuine bug, even if
it's hard to hit.

> FWIW, numacore pulled yesterday completed the same tests
> without any error this time but none of the commits since Dec
> 9th would account for fixing it.

Correct. I think chances are that it's still latent. Either
fixed in your version of the code, which will be hard to
reconstruct - or it's an active upstream bug.

I'd not blame it on the JVM for a good while - JVMs are one of
the most abused pieces of code on the planet, literally running
millions of applications on thousands of kernel variants.

Could you try the patch below on latest upstream with
CONFIG_NUMA_BALANCING=y, it increases migration bandwidth
10-fold - does it make it easier to trigger the bug on the now
upstream NUMA-balancing feature?

It will kill throughput on a number of your tests, but it should
make all the NUMA-specific activities during the JVM test a lot
more frequent.



diff --git a/mm/migrate.c b/mm/migrate.c
index 32efd80..8699e8f 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1511,7 +1511,7 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
static unsigned int migrate_interval_millisecs __read_mostly = 100;
static unsigned int pteupdate_interval_millisecs __read_mostly = 1000;
-static unsigned int ratelimit_pages __read_mostly = 128 << (20 - PAGE_SHIFT);
+static unsigned int ratelimit_pages __read_mostly = 1280 << (20 - PAGE_SHIFT);

/* Returns true if NUMA migration is currently rate limited */
bool migrate_ratelimited(int node)
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at