Re: [PATCH 4/4] sched/numa: Adjust imb_numa_nr to a better approximation of memory channels

From: Mel Gorman
Date: Wed May 18 2022 - 13:06:38 EST


On Wed, May 18, 2022 at 04:05:03PM +0200, Peter Zijlstra wrote:
> On Wed, May 18, 2022 at 12:15:39PM +0100, Mel Gorman wrote:
>
> > I'm not aware of how it can be done in-kernel on a cross architectural
> > basis. Reading through the arch manual, it states how many channels are
> > in a given processor family and it's available during memory check errors
> > (apparently via the EDAC driver). It's sometimes available via PMUs but
> > I couldn't find a place where it's generically available for topology.c
> > that would work on all x86-64 machines let alone every other architecture.
>
> So provided it is something we want (below) we can always start an arch
> interface and fill it out where needed.
>

It could start with a function with a fixed value that architectures
can override but it might be a deep rabbit hole to discover and wire
it all up. The most straight-forward would be based on CPU family and
model but time consuming to maintain. It gets fuzzy if it's something
like PowerKVM where channel details are hidden. It could be a deep
rabbit hole.

> > It's not even clear if SMBIOS was parsed in early boot whether
>
> We can always rebuild topology / update variables slightly later in
> boot.
>
> > it's a
> > good idea. It could result in difference imbalance thresholds for each
> > NUMA domain or weird corner cases where assymetric NUMA node populations
> > would result in run-to-run variance that are difficult to analyse.
>
> Yeah, maybe. OTOH having a magic value that's guestimated based on
> hardware of the day is something that'll go bad any moment as well.
>
> I'm not too worried about run-to-run since people don't typically change
> DIMM population over a reboot, but yes, there's always going to be
> corner cases. Same with a fixed value though, that's also going to be
> wrong.
>

By run-to-run, I mean just running the same workload in a loop and
not rebooting between runs. If there are differences in how nodes are
populated, there will be some run-to-run variance based purely on what
node the workload started on because they will have different "allowed
imbalance" thresholds.

I'm running the tests to recheck exactly how much impact this patch has
on the peak performance. It takes a few hours so I won't have anything
until tomorrow.

Initially "get peak performance" and "stabilise run-to-run variances"
were my objectives. This series only aimed at the peak performance for a
finish as allowed NUMA imbalance was not the sole cause of the problem.
I still haven't spent time figuring out why c6f886546cb8 ("sched/fair:
Trigger the update of blocked load on newly idle cpu") made such a big
difference to variability.

--
Mel Gorman
SUSE Labs