Group Imbalance bug - performance drop upto factor 10x

From: Jirka Hladky
Date: Mon Feb 06 2017 - 18:38:03 EST


Hello,

we observe that group imbalance bug can cause performance degradation
upto factor 10x on 4 NUMA server.

I have opened Bug 194231
https://bugzilla.kernel.org/show_bug.cgi?id=194231
for this issue.

The problem was first described in this paper

http://www.ece.ubc.ca/~sasha/papers/eurosys16-final29.pdf

in chapter 3.1. Scheduler is not correctly balancing load on 4 NUMA
node server in the following scenario:
* there are three independent ssh connections
* first two ssh connections are running single threaded CPU intensive workload
* last ssh session is running multi-threaded application which
requires almost all cores in the system.

We have used
* stress --cpu 1 as single threaded CPU intensive workload
http://people.seas.harvard.edu/~apw/stress/
and
* lu.C.x benchmark from NAS Parallel Benchmarks suite as
multi-threaded application
https://www.nas.nasa.gov/publications/npb.html

Version-Release number of selected component (if applicable):
Reproduced on

kernel 4.10.0-0.rc6


How reproducible:

It requires at least 2 NUMA server. Problem gets worse on 4 NUMA server.


Steps to Reproduce:
1. start 3 ssh connections to server
2. in first two ssh connections run stress --cpu 1
3. in the third ssh connection run lu.C.x benchmark with number of
threads equal to number of CPUs in the system minus 4.
4. run either Intel's numatop
echo "N" | numatop -d log >/dev/null 2>&1 &
or mpstat -P ALL 5 and check the load distribution across the NUMA
nodes. mpstat output can be processed by mpstat2node.py utility to
aggregate data across NUMA nodes
https://github.com/jhladka/MPSTAT2NODE/blob/master/mpstat2node.py

mpstat -P ALL 5 | mpstat2node.py --lscpu <(lscpu)

5. Compare the results against the same workload started from ONE ssh
session (all processes are in one group)


Actual results:

Uneven load across NUMA nodes:
Average: NODE %usr %idle
Average: all 66.12 33.51
Average: 0 37.97 61.74
Average: 1 31.67 68.15
Average: 2 97.50 1.98
Average: 3 97.33 2.19

Please notice that while number of CPU intensive threads is 62 on this
64 CPU system, NUMA nodes #0 and #1 are underutilized.

Real runtime in seconds for lu.C.x benchmark went up from 114 seconds
to 846 seconds!

Expected results:

Load evenly balanced across all NUMA nodes. Real runtime for lu.C.x
benchmark same regardless if jobs were started from one ssh session or
from multiply ssh sessions.

Additional info:

See
https://github.com/jplozi/wastedcores/blob/master/patches/group_imbalance_linux_4.1.patch
as proposal for the patch for kernel 4.1.

I will upload a reproduced to the Bug report
https://bugzilla.kernel.org/show_bug.cgi?id=194231

Thanks a lot!
Jirka