[RFC PATCH 00/10] Improve numa scheduling by consolidating tasks

From: Srikar Dronamraju
Date: Tue Jul 30 2013 - 03:48:52 EST

Next message: Srikar Dronamraju: "[RFC PATCH 01/10] sched: Introduce per node numa weights"
Previous message: Michal Hocko: "Re: [PATCH resend] drop_caches: add some documentation and infomessage"
Next in thread: Srikar Dronamraju: "[RFC PATCH 01/10] sched: Introduce per node numa weights"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Here is an approach that looks to consolidate workloads across nodes.
This results in much improved performance. Again I would assume this work
is complementary to Mel's work with numa faulting.

Here are the advantages of this approach.
1. Provides excellent consolidation of tasks.
From my experiments, I have found that the better the task
consolidation, we achieve better the memory layout, which results in
better the performance.

2. Provides good improvement in most cases, but there are some regressions.

3. Looks to extend the load balancer esp when the cpus are idling.

Here is the outline of the approach.

- Every process has a per node array where we store the weight of all
its tasks running on that node. This arrays gets updated on task
enqueue/dequeue.

- Added a 2 pass mechanism (somewhat taken from numacore but not
exactly) while choosing tasks to move across nodes.

In the first pass, choose only tasks that are ideal to be moved.
While choosing a task, look at the per node process arrays to see if
moving task helps.
If the first pass fails to move a task, any task can be chosen on the
second pass.

- If the regular load balancer (rebalance_domain()) fails to balance the
load (or finds no imbalance) and there is a cpu, use the cpu to
consolidate tasks to the nodes by using the information in the per
node process arrays.

Every idle cpu if its doesnt have tasks queued after load balance,
- will walk thro the cpus in its node and checks if there are buddy
tasks that are not part of the node but should have been ideally
part of this node.
- To make sure that we dont pull all buddy tasks and create an
imbalance, we look at load on the load, pinned tasks and the
processes contribution to the load for this node.
- Each cpu looks at the node which has the least number of buddy tasks
running and tries to pull the tasks from such nodes.

- Once it finds the cpu from which to pull the tasks, it triggers
active_balancing. This type of active balancing triggers just one
pass. i.e it only fetches tasks that increase numa locality.

Here are results of specjbb run on a 2 node machine.
Specjbb was run on 3 vms.
In the fit case, one vm was big to fit one node size.
In the no-fit case, one vm was bigger than the node size.

-------------------------------------------------------------------------------------
|kernel | nofit| fit| vm|
|kernel | noksm| ksm| noksm| ksm| vm|
|kernel | nothp| thp| nothp| thp| nothp| thp| nothp| thp| vm|
--------------------------------------------------------------------------------------
|v3.9 | 136056| 189423| 135359| 186722| 136983| 191669| 136728| 184253| vm_1|
|v3.9 | 66041| 84779| 64564| 86645| 67426| 84427| 63657| 85043| vm_2|
|v3.9 | 67322| 83301| 63731| 85394| 65015| 85156| 63838| 84199| vm_3|
--------------------------------------------------------------------------------------
|v3.9 + Mel(v5)| 133170| 177883| 136385| 176716| 140650| 174535| 132811| 190120| vm_1|
|v3.9 + Mel(v5)| 65021| 81707| 62876| 81826| 63635| 84943| 58313| 78997| vm_2|
|v3.9 + Mel(v5)| 61915| 82198| 60106| 81723| 64222| 81123| 59559| 78299| vm_3|
| % change | -2.12| -6.09| 0.76| -5.36| 2.68| -8.94| -2.86| 3.18| vm_1|
| % change | -1.54| -3.62| -2.61| -5.56| -5.62| 0.61| -8.39| -7.11| vm_2|
| % change | -8.03| -1.32| -5.69| -4.30| -1.22| -4.74| -6.70| -7.01| vm_3|
--------------------------------------------------------------------------------------
|v3.9 + this | 136766| 189704| 148642| 180723| 147474| 184711| 139270| 186768| vm_1|
|v3.9 + this | 72742| 86980| 67561| 91659| 69781| 87741| 65989| 83508| vm_2|
|v3.9 + this | 66075| 90591| 66135| 90059| 67942| 87229| 66100| 85908| vm_3|
| % change | 0.52| 0.15| 9.81| -3.21| 7.66| -3.63| 1.86| 1.36| vm_1|
| % change | 10.15| 2.60| 4.64| 5.79| 3.49| 3.93| 3.66| -1.80| vm_2|
| % change | -1.85| 8.75| 3.77| 5.46| 4.50| 2.43| 3.54| 2.03| vm_3|
--------------------------------------------------------------------------------------

Autonuma benchmark results on a 2 node machine:
KernelVersion: 3.9.0
Testcase: Min Max Avg StdDev
numa01: 118.98 122.37 120.96 1.17
numa01_THREAD_ALLOC: 279.84 284.49 282.53 1.65
numa02: 36.84 37.68 37.09 0.31
numa02_SMT: 44.67 48.39 47.32 1.38

KernelVersion: 3.9.0 + Mel's v5
Testcase: Min Max Avg StdDev %Change
numa01: 115.02 123.08 120.83 3.04 0.11%
numa01_THREAD_ALLOC: 268.59 298.47 281.15 11.16 0.46%
numa02: 36.31 37.34 36.68 0.43 1.10%
numa02_SMT: 43.18 43.43 43.29 0.08 9.28%

KernelVersion: 3.9.0 + this patchset
Testcase: Min Max Avg StdDev %Change
numa01: 103.46 112.31 106.44 3.10 12.93%
numa01_THREAD_ALLOC: 277.51 289.81 283.88 4.98 -0.47%
numa02: 36.72 40.81 38.42 1.85 -3.26%
numa02_SMT: 56.50 60.00 58.08 1.23 -17.93%

KernelVersion: 3.9.0(HT)
Testcase: Min Max Avg StdDev
numa01: 241.23 244.46 242.94 1.31
numa01_THREAD_ALLOC: 301.95 307.39 305.04 2.20
numa02: 41.31 43.92 42.98 1.02
numa02_SMT: 37.02 37.58 37.44 0.21

KernelVersion: 3.9.0 + Mel's v5 (HT)
Testcase: Min Max Avg StdDev %Change
numa01: 238.42 242.62 241.60 1.60 0.55%
numa01_THREAD_ALLOC: 285.01 298.23 291.54 5.37 4.53%
numa02: 38.08 38.16 38.11 0.03 12.76%
numa02_SMT: 36.20 36.64 36.36 0.17 2.95%

KernelVersion: 3.9.0 + this patchset(HT)
Testcase: Min Max Avg StdDev %Change
numa01: 175.17 189.61 181.90 5.26 32.19%
numa01_THREAD_ALLOC: 285.79 365.26 305.27 30.35 -0.06%
numa02: 38.26 38.97 38.50 0.25 11.50%
numa02_SMT: 44.66 49.22 46.22 1.60 -17.84%

Autonuma benchmark results on a 4 node machine:
# dmidecode | grep 'Product Name:'
Product Name: System x3750 M4 -[8722C1A]-
# numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 32 33 34 35 36 37 38 39
node 0 size: 65468 MB
node 0 free: 63890 MB
node 1 cpus: 8 9 10 11 12 13 14 15 40 41 42 43 44 45 46 47
node 1 size: 65536 MB
node 1 free: 64033 MB
node 2 cpus: 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55
node 2 size: 65536 MB
node 2 free: 64236 MB
node 3 cpus: 24 25 26 27 28 29 30 31 56 57 58 59 60 61 62 63
node 3 size: 65536 MB
node 3 free: 64162 MB
node distances:
node 0 1 2 3
0: 10 11 11 12
1: 11 10 12 11
2: 11 12 10 11
3: 12 11 11 10

KernelVersion: 3.9.0
Testcase: Min Max Avg StdDev
numa01: 581.35 761.95 681.23 80.97
numa01_THREAD_ALLOC: 140.39 164.45 150.34 7.98
numa02: 18.47 20.12 19.25 0.65
numa02_SMT: 16.40 25.30 21.06 2.86

KernelVersion: 3.9.0 + Mel's v5 patchset
Testcase: Min Max Avg StdDev %Change
numa01: 733.15 767.99 748.88 14.51 -8.81%
numa01_THREAD_ALLOC: 154.18 169.13 160.48 5.76 -6.00%
numa02: 19.09 22.15 21.02 1.03 -7.99%
numa02_SMT: 23.01 25.53 23.98 0.83 -11.44%

KernelVersion: 3.9.0 + this patchset
Testcase: Min Max Avg StdDev %Change
numa01: 409.64 457.91 444.55 17.66 51.69%
numa01_THREAD_ALLOC: 158.10 174.89 169.32 5.84 -10.85%
numa02: 18.89 22.36 19.98 1.29 -3.26%
numa02_SMT: 23.33 27.87 25.02 1.68 -14.21%

KernelVersion: 3.9.0 (HT)
Testcase: Min Max Avg StdDev
numa01: 567.62 752.06 620.26 66.72
numa01_THREAD_ALLOC: 145.84 172.44 160.73 10.34
numa02: 18.11 20.06 19.10 0.67
numa02_SMT: 17.59 22.83 19.94 2.17

KernelVersion: 3.9.0 + Mel's v5 patchset (HT)
Testcase: Min Max Avg StdDev %Change
numa01: 741.13 753.91 748.10 4.51 -16.96%
numa01_THREAD_ALLOC: 153.57 162.45 158.22 3.18 1.55%
numa02: 19.15 20.96 20.04 0.64 -4.48%
numa02_SMT: 22.57 25.92 23.87 1.15 -15.16%

KernelVersion: 3.9.0 + this patchset (HT)
Testcase: Min Max Avg StdDev %Change
numa01: 418.46 457.77 436.00 12.81 40.25%
numa01_THREAD_ALLOC: 156.21 169.79 163.75 4.37 -1.78%
numa02: 18.41 20.18 19.06 0.60 0.20%
numa02_SMT: 22.72 27.24 25.29 1.76 -19.64%

Autonuma results on a 8 node machine:

# dmidecode | grep 'Product Name:'
Product Name: IBM x3950-[88722RZ]-

# numactl -H
available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 32510 MB
node 0 free: 31475 MB
node 1 cpus: 8 9 10 11 12 13 14 15
node 1 size: 32512 MB
node 1 free: 31709 MB
node 2 cpus: 16 17 18 19 20 21 22 23
node 2 size: 32512 MB
node 2 free: 31737 MB
node 3 cpus: 24 25 26 27 28 29 30 31
node 3 size: 32512 MB
node 3 free: 31736 MB
node 4 cpus: 32 33 34 35 36 37 38 39
node 4 size: 32512 MB
node 4 free: 31739 MB
node 5 cpus: 40 41 42 43 44 45 46 47
node 5 size: 32512 MB
node 5 free: 31639 MB
node 6 cpus: 48 49 50 51 52 53 54 55
node 6 size: 65280 MB
node 6 free: 63836 MB
node 7 cpus: 56 57 58 59 60 61 62 63
node 7 size: 65280 MB
node 7 free: 64043 MB
node distances:
node 0 1 2 3 4 5 6 7
0: 10 20 20 20 20 20 20 20
1: 20 10 20 20 20 20 20 20
2: 20 20 10 20 20 20 20 20
3: 20 20 20 10 20 20 20 20
4: 20 20 20 20 10 20 20 20
5: 20 20 20 20 20 10 20 20
6: 20 20 20 20 20 20 10 20
7: 20 20 20 20 20 20 20 10

KernelVersion: 3.9.0
Testcase: Min Max Avg StdDev
numa01: 1796.11 1848.89 1812.39 19.35
numa02: 55.05 62.32 58.30 2.37

KernelVersion: 3.9.0-mel_numa_balancing+()
Testcase: Min Max Avg StdDev %Change
numa01: 1758.01 1929.12 1853.15 77.15 -2.11%
numa02: 50.96 53.63 52.12 0.90 11.52%

KernelVersion: 3.9.0-numa_balancing_v39+()
Testcase: Min Max Avg StdDev %Change
numa01: 1081.66 1939.94 1500.01 350.20 16.10%
numa02: 35.32 43.92 38.64 3.35 44.76%

TODOs:
1. Use task loads for numa weights
2. Use numa faults as secondary key while moving threads

Andrea Arcangeli (1):
x86, mm: Prevent gcc to re-read the pagetables

Srikar Dronamraju (9):
sched: Introduce per node numa weights
sched: Use numa weights while migrating tasks
sched: Select a better task to pull across node using iterations
sched: Move active_load_balance_cpu_stop to a new helper function
sched: Extend idle balancing to look for consolidation of tasks
sched: Limit migrations from a node
sched: Pass hint to active balancer about the task to be chosen
sched: Prevent a task from migrating immediately after an active balance
sched: Choose a runqueue that has lesser local affinity tasks

arch/x86/mm/gup.c | 23 ++-
fs/exec.c | 6 +
include/linux/mm_types.h | 2 +
include/linux/sched.h | 4 +
kernel/fork.c | 11 +-
kernel/sched/core.c | 2 +
kernel/sched/fair.c | 443 ++++++++++++++++++++++++++++++++++++++++++++--
kernel/sched/sched.h | 4 +
mm/memory.c | 2 +-
9 files changed, 475 insertions(+), 22 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Srikar Dronamraju: "[RFC PATCH 01/10] sched: Introduce per node numa weights"
Previous message: Michal Hocko: "Re: [PATCH resend] drop_caches: add some documentation and infomessage"
Next in thread: Srikar Dronamraju: "[RFC PATCH 01/10] sched: Introduce per node numa weights"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]