Re: [PATCH] nohz: nohz idle balancing per node

From: Mel Gorman
Date: Fri Jul 02 2021 - 04:49:59 EST


On Fri, Jul 02, 2021 at 09:33:45AM +1000, Nicholas Piggin wrote:
> Excerpts from Mel Gorman's message of July 1, 2021 11:11 pm:
> > On Thu, Jul 01, 2021 at 12:18:18PM +0200, Peter Zijlstra wrote:
> >> On Thu, Jul 01, 2021 at 03:53:23PM +1000, Nicholas Piggin wrote:
> >> > Currently a single nohz idle CPU is designated to perform balancing on
> >> > behalf of all other nohz idle CPUs in the system. Implement a per node
> >> > nohz balancer to minimize cross-node memory accesses and runqueue lock
> >> > acquisitions.
> >> >
> >> > On a 4 node system, this improves performance by 9.3% on a 'pgbench -N'
> >> > with 32 clients/jobs (which is about where throughput maxes out due to
> >> > IO and contention in postgres).
> >>
> >> Hmm, Suresh tried something like this around 2010 and then we ran into
> >> trouble that when once node went completely idle and another node was
> >> fully busy, the completely idle node would not run ILB and the node
> >> would forever stay idle.
> >>
> >
> > An effect like that *might* be visible at
> > https://beta.suse.com/private/mgorman/melt/v5.13/3-perf-test/sched/sched-nohznuma-v1r1/html/network-tbench/hardy2/
> > at the CPU usage heatmaps ordered by topology at the very bottom of
> > the page.
> >
> > The heatmap covers all client counts so there are "blocks" of activity for
> > each client count tested. The third block is for 8 thread counts so a node
> > is not fully busy yet.
>
> I'm not sure what I'm looking at. Where are these blocks? Along the x
> axis?
>

The X axis is time. Each row is a CPU with a vertical line colored based on
the utilisation (white for idle, green for low utilisation, red for higher
utilisation). Along the Y axis, for a 2-socket machine, the top half 1
one node, the bottom half is the second node. Each "pair" of rows, where
pairs are indicated on the left with the CPU number, are SMT siblings.

The "blocks" along the xaxis represent 3 minutes running the benchmark
for a given client which is why the pattern changes as it starts at 1
client and increases the client count over time.

> > However, with the vanilla kernel, there is some
> > load on each node but with the patch all the load is on one node. This
> > did not happen on the two other test machines so the observation is not
> > reliable and could be a total coincidence.
>
> tbench is pretty finicky so it could be.
>

It's even likely. It's also not triggering the situation Peter described --
"once node went completely idle and another node was fully busy", tbench
doesn't do this except by accident.

> >
> > That said, there were some gains but large losses depending on the client
> > count across the 3 machines for tbench which is a concern. Other results,
> > like pgbench mentioned in the changelog, will not complete until tomorrow
> > to see if it is a general pattern or tbench-specific.
> >
> > https://beta.suse.com/private/mgorman/melt/v5.13/3-perf-test/sched/sched-nohznuma-v1r1/html/network-tbench/bing2/
> > https://beta.suse.com/private/mgorman/melt/v5.13/3-perf-test/sched/sched-nohznuma-v1r1/html/network-tbench/hardy2/
> > https://beta.suse.com/private/mgorman/melt/v5.13/3-perf-test/sched/sched-nohznuma-v1r1/html/network-tbench/marvin2/
>
> All 2-node.

Yes, I only use a limited set of machines initially. It's only when
2-node passes that a series may get evaluated on large machines, zen*
generations etc.

> How many runs does it do at each clinet count? There's a big
> regression at one clinet with one of them, but the other two have small
> gains.
>

tbench can be finicky so I treat it with caution even though I generally
use it a sniff test. Only one iteration is run per thread count. What is
graphed is the reported throughput over time which is a dangerous metric
because each datapoint is average throughput since the test started. A more
robust metric would be to run the benchmark multiple times taking the final
throughput for each iteration. There are a few reasons why I didn't do that

If the reported throughput is highly variable over time, that is an
interesting result in itself as it can imply that steady progress is
not being made.

Looking at how the workload behaves during an iteration is useful

It takes an unreasonable amount of time to run it multiple times and
the full mix of scheduler tests I ran takes almost a day per tested
kernel. For example, these tests are not even complete yet and
probably won't be until late tonight.

I'm well aware of the limitations of testing tbench like this. dbench for
example is a similiarly designed workload except dbench gets patched to
report the time taken to process the loadfile once and uses that as a
primary metric instead of estimated throughput.

--
Mel Gorman
SUSE Labs