Re: [PATCH 1/3] sched: remove select_idle_core() for scalability

From: Subhra Mazumdar
Date: Mon Apr 30 2018 - 19:36:33 EST




On 04/25/2018 10:49 AM, Peter Zijlstra wrote:
On Tue, Apr 24, 2018 at 02:45:50PM -0700, Subhra Mazumdar wrote:
So what you said makes sense in theory but is not borne out by real
world results. This indicates that threads of these benchmarks care more
about running immediately on any idle cpu rather than spending time to find
fully idle core to run on.
But you only ran on Intel which emunerates siblings far apart in the
cpuid space. Which is not something we should rely on.

So by only doing a linear scan on CPU number you will actually fill
cores instead of equally spreading across cores. Worse still, by
limiting the scan to _4_ you only barely even get onto a next core for
SMT4 hardware, never mind SMT8.
Again this doesn't matter for the benchmarks I ran. Most are happy to make
the tradeoff on x86 (SMT2). Limiting the scan is mitigated by the fact that
the scan window is rotated over all cpus, so idle cpus will be found soon.
You've not been reading well. The Intel machine you tested this on most
likely doesn't suffer that problem because of the way it happens to
iterate SMT threads.

How does Sparc iterate its SMT siblings in cpuid space?
SPARC does sequential enumeration of siblings first, although this needs to
be confirmed if non-sequential enumeration on x86 is the reason of the
improvements through tests. I don't have a SPARC test system handy now.

Also, your benchmarks chose an unfortunate nr of threads vs topology.
The 2^n thing chosen never hits the 100% core case (6,22 resp.).

So while I'm not adverse to limiting the empty core search; I do feel it
is important to have. Overloading cores when you don't have to is not
good.
Can we have a config or a way for enabling/disabling select_idle_core?
I like Rohit's suggestion of folding select_idle_core and
select_idle_cpu much better, then it stays SMT aware.

Something like the completely untested patch below.
I tried both the patches you suggested, the first with merging of
select_idle_core and select_idle_cpu and second with the new way of
calculating avg_idle and finally both combined. I ran the following
benchmarks for each, the merge only patch seems to giving similar
improvements as my original patch for Uperf and Oracle DB tests, but it
regresses for hackbench. If we can fix this I am OK with it. I can do a run
of other benchamrks after that.

I also noticed a possible bug later in the merge code. Shouldn't it be:

if (busy < best_busy) {
ÂÂÂÂÂÂÂ best_busy = busy;
ÂÂÂÂÂÂÂ best_cpu = first_idle;
}

Unfortunately I noticed it after all runs.

merge:

Hackbench process on 2 socket, 44 core and 88 threads Intel x86 machine
(lower is better):
groups baseline %stdev patch %stdev
1ÂÂÂÂÂÂ 0.5742ÂÂÂÂÂÂÂÂ 21.13ÂÂ 0.5099 (11.2%) 2.24
2ÂÂÂÂÂÂ 0.5776ÂÂÂÂÂÂÂÂ 7.87ÂÂÂ 0.5385 (6.77%) 3.38
4ÂÂÂÂÂÂ 0.9578ÂÂÂÂÂÂÂÂ 1.12ÂÂÂ 1.0626 (-10.94%) 1.35
8ÂÂÂÂÂÂ 1.7018ÂÂÂÂÂÂÂÂ 1.35ÂÂÂ 1.8615 (-9.38%) 0.73
16ÂÂÂÂÂ 2.9955ÂÂÂÂÂÂÂÂ 1.36ÂÂÂ 3.2424 (-8.24%) 0.66
32ÂÂÂÂÂ 5.4354ÂÂÂÂÂÂÂÂ 0.59ÂÂÂ 5.749Â (-5.77%) 0.55

Uperf pingpong on 2 socket, 44 core and 88 threads Intel x86 machine with
message size = 8k (higher is better):
threads baseline %stdev patch %stdev
8ÂÂÂÂÂÂ 49.47ÂÂÂÂÂÂÂÂÂÂ 0.35ÂÂÂ 49.98 (1.03%) 1.36
16ÂÂÂÂÂ 95.28ÂÂÂÂÂÂÂÂÂÂ 0.77ÂÂÂ 97.46 (2.29%) 0.11
32ÂÂÂÂÂ 156.77ÂÂÂÂÂÂÂÂÂ 1.17ÂÂÂ 167.03 (6.54%) 1.98
48ÂÂÂÂÂ 193.24ÂÂÂÂÂÂÂÂÂ 0.22ÂÂÂ 230.96 (19.52%) 2.44
64ÂÂÂÂÂ 216.21ÂÂÂÂÂÂÂÂÂ 9.33ÂÂÂ 299.55 (38.54%) 4
128ÂÂÂÂ 379.62ÂÂÂÂÂÂÂÂÂ 10.29ÂÂ 357.87 (-5.73%) 0.85

Oracle DB on 2 socket, 44 core and 88 threads Intel x86 machine
(normalized, higher is better):
users baseline %stdev patch %stdev
20ÂÂÂÂÂ 1ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 1.35ÂÂÂ 0.9919 (-0.81%) 0.14
40ÂÂÂÂÂ 1ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 0.42ÂÂÂ 0.9959 (-0.41%) 0.72
60ÂÂÂÂÂ 1ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 1.54ÂÂÂ 0.9872 (-1.28%) 1.27
80ÂÂÂÂÂ 1ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 0.58ÂÂÂ 0.9925 (-0.75%) 0.5
100ÂÂÂÂ 1ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 0.77ÂÂÂ 1.0145 (1.45%) 1.29
120ÂÂÂÂ 1ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 0.35ÂÂÂ 1.0136 (1.36%) 1.15
140ÂÂÂÂ 1ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 0.19ÂÂÂ 1.0404 (4.04%) 0.91
160ÂÂÂÂ 1ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 0.09ÂÂÂ 1.0317 (3.17%) 1.41
180ÂÂÂÂ 1ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 0.99ÂÂÂ 1.0322 (3.22%) 0.51
200ÂÂÂÂ 1ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 1.03ÂÂÂ 1.0245 (2.45%) 0.95
220ÂÂÂÂ 1ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 1.69ÂÂÂ 1.0296 (2.96%) 2.83

new avg_idle:

Hackbench process on 2 socket, 44 core and 88 threads Intel x86 machine
(lower is better):
groups baseline %stdev patch %stdev
1ÂÂÂÂÂÂ 0.5742ÂÂÂÂÂÂÂÂ 21.13ÂÂ 0.5241 (8.73%) 8.26
2ÂÂÂÂÂÂ 0.5776ÂÂÂÂÂÂÂÂ 7.87ÂÂÂ 0.5436 (5.89%) 8.53
4ÂÂÂÂÂÂ 0.9578ÂÂÂÂÂÂÂÂ 1.12ÂÂÂ 0.989 (-3.26%) 1.9
8ÂÂÂÂÂÂ 1.7018ÂÂÂÂÂÂÂÂ 1.35ÂÂÂ 1.7568 (-3.23%) 1.22
16ÂÂÂÂÂ 2.9955ÂÂÂÂÂÂÂÂ 1.36ÂÂÂ 3.1119 (-3.89%) 0.92
32ÂÂÂÂÂ 5.4354ÂÂÂÂÂÂÂÂ 0.59ÂÂÂ 5.5889 (-2.82%) 0.64

Uperf pingpong on 2 socket, 44 core and 88 threads Intel x86 machine with
message size = 8k (higher is better):
threads baseline %stdev patch %stdev
8ÂÂÂÂÂÂ 49.47ÂÂÂÂÂÂÂÂÂÂ 0.35ÂÂÂ 48.11 (-2.75%) 0.29
16ÂÂÂÂÂ 95.28ÂÂÂÂÂÂÂÂÂÂ 0.77ÂÂÂ 93.67 (-1.68%) 0.68
32ÂÂÂÂÂ 156.77ÂÂÂÂÂÂÂÂÂ 1.17ÂÂÂ 158.28 (0.96%) 0.29
48ÂÂÂÂÂ 193.24ÂÂÂÂÂÂÂÂÂ 0.22ÂÂÂ 190.04 (-1.66%) 0.34
64ÂÂÂÂÂ 216.21ÂÂÂÂÂÂÂÂÂ 9.33ÂÂÂ 189.45 (-12.38%) 2.05
128ÂÂÂÂ 379.62ÂÂÂÂÂÂÂÂÂ 10.29ÂÂ 326.59 (-13.97%) 13.07

Oracle DB on 2 socket, 44 core and 88 threads Intel x86 machine
(normalized, higher is better):
users baseline %stdev patch %stdev
20ÂÂÂÂÂ 1ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 1.35ÂÂÂ 1.0026 (0.26%) 0.25
40ÂÂÂÂÂ 1ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 0.42ÂÂÂ 0.9857 (-1.43%) 1.47
60ÂÂÂÂÂ 1ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 1.54ÂÂÂ 0.9903 (-0.97%) 0.99
80ÂÂÂÂÂ 1ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 0.58ÂÂÂ 0.9968 (-0.32%) 1.19
100ÂÂÂÂ 1ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 0.77ÂÂÂ 0.9933 (-0.67%) 0.53
120ÂÂÂÂ 1ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 0.35ÂÂÂ 0.9919 (-0.81%) 0.9
140ÂÂÂÂ 1ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 0.19ÂÂÂ 0.9915 (-0.85%) 0.36
160ÂÂÂÂ 1ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 0.09ÂÂÂ 0.9811 (-1.89%) 1.21
180ÂÂÂÂ 1ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 0.99ÂÂÂ 1.0002 (0.02%) 0.87
200ÂÂÂÂ 1ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 1.03ÂÂÂ 1.0037 (0.37%) 2.5
220ÂÂÂÂ 1ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 1.69ÂÂÂ 0.998 (-0.2%) 0.8

merge + new avg_idle:

Hackbench process on 2 socket, 44 core and 88 threads Intel x86 machine
(lower is better):
groups baseline %stdev patch %stdev
1ÂÂÂÂÂÂ 0.5742ÂÂÂÂÂÂÂÂ 21.13ÂÂ 0.6522 (-13.58%) 12.53
2ÂÂÂÂÂÂ 0.5776ÂÂÂÂÂÂÂÂ 7.87ÂÂÂ 0.7593 (-31.46%) 2.7
4ÂÂÂÂÂÂ 0.9578ÂÂÂÂÂÂÂÂ 1.12ÂÂÂ 1.0952 (-14.35%) 1.08
8ÂÂÂÂÂÂ 1.7018ÂÂÂÂÂÂÂÂ 1.35ÂÂÂ 1.8722 (-10.01%) 0.68
16ÂÂÂÂÂ 2.9955ÂÂÂÂÂÂÂÂ 1.36ÂÂÂ 3.2987 (-10.12%) 0.58
32ÂÂÂÂÂ 5.4354ÂÂÂÂÂÂÂÂ 0.59ÂÂÂ 5.7751 (-6.25%) 0.46

Uperf pingpong on 2 socket, 44 core and 88 threads Intel x86 machine with
message size = 8k (higher is better):
threads baseline %stdev patch %stdev
8ÂÂÂÂÂÂ 49.47ÂÂÂÂÂÂÂÂÂÂ 0.35ÂÂÂ 51.29 (3.69%) 0.86
16ÂÂÂÂÂ 95.28ÂÂÂÂÂÂÂÂÂÂ 0.77ÂÂÂ 98.95 (3.85%) 0.41
32ÂÂÂÂÂ 156.77ÂÂÂÂÂÂÂÂÂ 1.17ÂÂÂ 165.76 (5.74%) 0.26
48ÂÂÂÂÂ 193.24ÂÂÂÂÂÂÂÂÂ 0.22ÂÂÂ 234.25 (21.22%) 0.63
64ÂÂÂÂÂ 216.21ÂÂÂÂÂÂÂÂÂ 9.33ÂÂÂ 306.87 (41.93%) 2.11
128ÂÂÂÂ 379.62ÂÂÂÂÂÂÂÂÂ 10.29ÂÂ 355.93 (-6.24%) 8.28

Oracle DB on 2 socket, 44 core and 88 threads Intel x86 machine
(normalized, higher is better):
users baseline %stdev patch %stdev
20ÂÂÂÂÂ 1ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 1.35ÂÂÂ 1.0085 (0.85%) 0.72
40ÂÂÂÂÂ 1ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 0.42ÂÂÂ 1.0017 (0.17%) 0.3
60ÂÂÂÂÂ 1ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 1.54ÂÂÂ 0.9974 (-0.26%) 1.18
80ÂÂÂÂÂ 1ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 0.58ÂÂÂ 1.0115 (1.15%) 0.93
100ÂÂÂÂ 1ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 0.77ÂÂÂ 0.9959 (-0.41%) 1.21
120ÂÂÂÂ 1ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 0.35ÂÂÂ 1.0034 (0.34%) 0.72
140ÂÂÂÂ 1ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 0.19ÂÂÂ 1.0123 (1.23%) 0.93
160ÂÂÂÂ 1ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 0.09ÂÂÂ 1.0057 (0.57%) 0.65
180ÂÂÂÂ 1ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 0.99ÂÂÂ 1.0195 (1.95%) 0.99
200ÂÂÂÂ 1ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 1.03ÂÂÂ 1.0474 (4.74%) 0.55
220ÂÂÂÂ 1ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 1.69ÂÂÂ 1.0392 (3.92%) 0.36