Re: [PATCH 0/18] Basic scheduler support for automatic NUMAbalancing V5

From: Peter Zijlstra
Date: Wed Jul 31 2013 - 06:48:30 EST


On Wed, Jul 31, 2013 at 11:30:52AM +0100, Mel Gorman wrote:
> I'm not sure I understand your point. The scan rate is decreased again if
> the page is found to be properly placed in the future. It's in the next
> hunk you modify although the periodically reset comment is now out of date.

Yeah its because of the next hunk. I figured that if we don't lower it,
we shouldn't raise it either.

> > @@ -1167,10 +1171,20 @@ void task_numa_fault(int last_nidpid, in
> > /*
> > * If pages are properly placed (did not migrate) then scan slower.
> > * This is reset periodically in case of phase changes
> > - */
> > - if (!migrated)
> > + *
> > + * APZ: it seems to me that one can get a ton of !migrated faults;
> > + * consider the scenario where two threads fight over a shared memory
> > + * segment. We'll win half the faults, half of that will be local, half
> > + * of that will be remote. This means we'll see 1/4-th of the total
> > + * memory being !migrated. Using a fixed increment will completely
> > + * flatten the scan speed for a sufficiently large workload. Another
> > + * scenario is due to that migration rate limit.
> > + *
> > + if (!migrated) {
> > p->numa_scan_period = min(p->numa_scan_period_max,
> > p->numa_scan_period + jiffies_to_msecs(10));
> > + }
> > + */
>
> FWIW, I'm also not happy with how the scan rate is reduced but did not
> come up with a better alternative that was not fragile or depended on
> gathering too much state. Granted, I also have not been treating it as a
> high priority problem.

Right, so what Ingo did is have the scan rate depend on the convergence.
What exactly did you dislike about that?

We could define the convergence as all the faults inside the interleave
mask vs the total faults, and then run at: min + (1 - c)*(max-min).

> > +#if 0
> > /*
> > * We do not care about task placement until a task runs on a node
> > * other than the first one used by the address space. This is
> > * largely because migrations are driven by what CPU the task
> > * is running on. If it's never scheduled on another node, it'll
> > * not migrate so why bother trapping the fault.
> > + *
> > + * APZ: seems like a bad idea for pure shared memory workloads.
> > */
> > if (mm->first_nid == NUMA_PTE_SCAN_INIT)
> > mm->first_nid = numa_node_id();
>
> At some point in the past scan starts were based on waiting a fixed interval
> but that seemed like a hack designed to get around hurting kernel compile
> benchmarks. I'll give it more thought and see can I think of a better
> alternative that is based on an event but not this event.

Ah, well the reasoning on that was that all this NUMA business is
'expensive' so we'd better only bother with tasks that persist long
enough for it to pay off.

In that regard it makes perfect sense to wait a fixed amount of runtime
before we start scanning.

So it was not a pure hack to make kbuild work again.. that is did was
good though.

> > @@ -1254,9 +1272,14 @@ void task_numa_work(struct callback_head
> > * Do not set pte_numa if the current running node is rate-limited.
> > * This loses statistics on the fault but if we are unwilling to
> > * migrate to this node, it is less likely we can do useful work
> > - */
> > + *
> > + * APZ: seems like a bad idea; even if this node can't migrate anymore
> > + * other nodes might and we want up-to-date information to do balance
> > + * decisions.
> > + *
> > if (migrate_ratelimited(numa_node_id()))
> > return;
> > + */
> >
>
> Ingo also disliked this but I wanted to avoid a situation where the
> workload suffered because of a corner case where the interconnect was
> filled with migration traffic.

Right, but you already rate limit the actual migrations, this should
leave enough bandwidth to allow the non-migrating scanning.

I think its important we keep up-to-date information if we're going to
do placement based on it.

On that rate-limit, this looks to be a hard-coded number unrelated to
the actual hardware. I think we should at the very least make it a
configurable number and preferably scale the number with the SLIT info.
Or alternatively actually measure the node to node bandwidth.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/