Re: [PATCH 0/18] Basic scheduler support for automatic NUMAbalancing V5

From: Mel Gorman
Date: Wed Jul 31 2013 - 12:11:56 EST

Next message: Zach Levis: "Re: [PATCH 2/3] fs/binfmts: Better handling of binfmt loops"
Previous message: Vivek Goyal: "Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA"
In reply to: Peter Zijlstra: "Re: [PATCH 0/18] Basic scheduler support for automatic NUMAbalancing V5"
Next in thread: Peter Zijlstra: "Re: [PATCH 0/18] Basic scheduler support for automatic NUMAbalancing V5"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Wed, Jul 31, 2013 at 05:30:18PM +0200, Peter Zijlstra wrote:
> On Wed, Jul 31, 2013 at 12:57:19PM +0100, Mel Gorman wrote:
>
> > > Right, so what Ingo did is have the scan rate depend on the convergence.
> > > What exactly did you dislike about that?
> > >
> >
> > It depended entirely on properly detecting if we are converged or not. As
> > things like false share detection within THP is still not there I was
> > worried that it was too easy to make the wrong decision here and keep it
> > pinned at the maximum scan rate.
> >
> > > We could define the convergence as all the faults inside the interleave
> > > mask vs the total faults, and then run at: min + (1 - c)*(max-min).
> > >
> >
> > And when we have such things properly in place then I think we can kick
> > away the current crutch.
>
> OK, so I'll go write that patch I suppose ;-)
>
> > > Ah, well the reasoning on that was that all this NUMA business is
> > > 'expensive' so we'd better only bother with tasks that persist long
> > > enough for it to pay off.
> > >
> >
> > Which is fair enough but tasks that lasted *just* longer than the interval
> > still got punished. Processes running with a slightly slower CPU gets
> > hurts meaning that it would be a difficult bug report to digest.
> >
> > > In that regard it makes perfect sense to wait a fixed amount of runtime
> > > before we start scanning.
> > >
> > > So it was not a pure hack to make kbuild work again.. that is did was
> > > good though.
> > >
> >
> > Maybe we should reintroduce the delay then but I really would prefer that
> > it was triggered on some sort of event.
>
> Humm:
>
> kernel/sched/fair.c:
>
> /* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
> unsigned int sysctl_numa_balancing_scan_delay = 1000;
>
>
> kernel/sched/core.c:__sched_fork():
>
> numa_scan_period = sysctl_numa_balancing_scan_delay
>
>
> It seems its still there, no need to resuscitate.
>

Yes, reverting 5bca23035391928c4c7301835accca3551b96cc2 effectively restores
the behaviour you are looking for. It just seems very crude. Then again,
I also should not have left the scan delay on top of the first_nid
check.

> I share your preference for a clear event, although nothing really comes
> to mind. The entire multi-process space seems devoid of useful triggers.
>

RSS was another option it felt as arbitrary as a plain delay.

Should I revert 5bca23035391928c4c7301835accca3551b96cc2 with an
explanation that it potentially is completely useless in the purely
multi-process shared case?

> > > On that rate-limit, this looks to be a hard-coded number unrelated to
> > > the actual hardware.
> >
> > Guesstimate.
> >
> > > I think we should at the very least make it a
> > > configurable number and preferably scale the number with the SLIT info.
> > > Or alternatively actually measure the node to node bandwidth.
> > >
> >
> > Ideally we should just kick it away because scan rate limiting works
> > properly. Lets not make it a tunable just yet so we can avoid having to
> > deprecate it later.
>
> I'm not seeing how the rate-limit as per the convergence is going to
> help here.

It should reduce the potential number of NUMA hinting faults that can be
incurred. However, I accept your point because even it does not directly
avoid a large number of migration events.

> Suppose we migrate the task to another node and its going to
> stay there. Then our convergence is going down to 0 (all our memory is
> remote) so we end up at the max scan rate migrating every single page
> ASAP.
>
> This would completely and utterly saturate any interconnect.
>

Good point and we'd arrive back at rate limiting the migration in an
attempt to avoid it.

> Also, in the case we don't have a fully connected system the memory
> transfers will need multiple hops, which greatly complicates the entire
> accounting trick :-)
>

Also unfortunately true. The larger the machine, the more likely this
becomes.

--
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Zach Levis: "Re: [PATCH 2/3] fs/binfmts: Better handling of binfmt loops"
Previous message: Vivek Goyal: "Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA"
In reply to: Peter Zijlstra: "Re: [PATCH 0/18] Basic scheduler support for automatic NUMAbalancing V5"
Next in thread: Peter Zijlstra: "Re: [PATCH 0/18] Basic scheduler support for automatic NUMAbalancing V5"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]