[PATCH 0/27] Basic scheduler support for automatic NUMA balancing V6

From: Mel Gorman
Date: Thu Aug 08 2013 - 10:08:32 EST


This is another revision of the scheduler patches for NUMA balancing. Peter
and Rik, note that the grouping patches, the cpunid conversion and the
task swapping patches are missing as I ran into trouble while testing
them. They are rebased and available in the linux-balancenuma.git tree to
save you the bother of having to rebase them yourselves.

Changelog since V6
o Various TLB flush optimisations
o Comment updates
o Sanitise task_numa_fault callsites for consistent semantics
o Revert some of the scanning adaption stuff
o Revert patch that defers scanning until task schedules on another node
o Start delayed scanning properly
o Avoid the same task always performing the PTE scan
o Continue PTE scanning even if migration is rate limited

Changelog since V5
o Add __GFP_NOWARN for numa hinting fault count
o Use is_huge_zero_page
o Favour moving tasks towards nodes with higher faults
o Optionally resist moving tasks towards nodes with lower faults
o Scan shared THP pages

Changelog since V4
o Added code that avoids overloading preferred nodes
o Swap tasks if nodes are overloaded and the swap does not impair locality

Changelog since V3
o Correct detection of unset last nid/pid information
o Dropped nr_preferred_running and replaced it with Peter's load balancing
o Pass in correct node information for THP hinting faults
o Pressure tasks sharing a THP page to move towards same node
o Do not set pmd_numa if false sharing is detected

Changelog since V2
o Reshuffle to match Peter's implied preference for layout
o Reshuffle to move private/shared split towards end of series to make it
easier to evaluate the impact
o Use PID information to identify private accesses
o Set the floor for PTE scanning based on virtual address space scan rates
instead of time
o Some locking improvements
o Do not preempt pinned tasks unless they are kernel threads

Changelog since V1
o Scan pages with elevated map count (shared pages)
o Scale scan rates based on the vsz of the process so the sampling of the
task is independant of its size
o Favour moving towards nodes with more faults even if it's not the
preferred node
o Laughably basic accounting of a compute overloaded node when selecting
the preferred node.
o Applied review comments

This series integrates basic scheduler support for automatic NUMA balancing.
It borrows very heavily from Peter Ziljstra's work in "sched, numa, mm:
Add adaptive NUMA affinity support" but deviates too much to preserve
Signed-off-bys. As before, if the relevant authors are ok with it I'll
add Signed-off-bys (or add them yourselves if you pick the patches up).

This is still far from complete and there are known performance gaps
between this series and manual binding (when that is possible). As before,
the intention is not to complete the work but to incrementally improve
mainline and preserve bisectability for any bug reports that crop up. In
some cases performance may be worse unfortunately and when that happens
it will have to be judged if the system overhead is lower and if so,
is it still an acceptable direction as a stepping stone to something better.

Patches 1-2 adds sysctl documentation and comment fixlets

Patch 3 corrects a THP NUMA hint fault accounting bug

Patch 4 avoids trying to migrate the THP zero page

Patch 5 sanitizes task_numa_fault callsites to have consist semantics and
always record the fault based on the correct location of the page.

Patch 6 avoids the same task being selected to perform the PTE scan within
a shared address space.

Patch 7 continues PTE scanning even if migration rate limited

Patch 8 notes that delaying the PTE scan until a task is scheduled on an
alternatie node misses the case where the task is only accessing
shared memory on a partially loaded machine and reverts a patch.

Patch 9 initialses numa_next_scan properly so that PTE scanning is delayed
when a process starts.

Patch 10 slows the scanning rate if the task is idle

Patch 11 sets the scan rate proportional to the size of the task being
scanned.

Patch 12 is a minor adjustment to scan rate

Patches 13-14 avoids TLB flushes during the PTE scan if no updates are made

Patch 15 tracks NUMA hinting faults per-task and per-node

Patches 16-20 selects a preferred node at the end of a PTE scan based on what
node incurrent the highest number of NUMA faults. When the balancer
is comparing two CPU it will prefer to locate tasks on their
preferred node. When initially selected the task is rescheduled on
the preferred node if it is not running on that node already. This
avoids waiting for the scheduler to move the task slowly.

Patch 21 adds infrastructure to allow separate tracking of shared/private
pages but treats all faults as if they are private accesses. Laying
it out this way reduces churn later in the series when private
fault detection is introduced

Patch 22 avoids some unnecessary allocation

Patch 23-24 kicks away some training wheels and scans shared pages and small VMAs.

Patch 25 introduces private fault detection based on the PID of the faulting
process and accounts for shared/private accesses differently.

Patch 26 pick the least loaded CPU based on a preferred node based on a scheduling
domain common to both the source and destination NUMA node.

Patch 27 retries task migration if an earlier attempt failed

Kernel 3.11-rc3 is the testing baseline.

o vanilla vanilla kernel with automatic numa balancing enabled
o prepare-v6 Patches 1-14
o favorpref-v6 Patches 1-22
o scanshared-v6 Patches 1-24
o splitprivate-v6 Patches 1-25
o accountload-v6 Patches 1-26
o retrymigrate-v6 Patches 1-27

This is SpecJBB running on a 4-socket machine with THP enabled and one JVM
running for the whole system. Only a limited number of clients are executed
to save on time.


specjbb
3.11.0-rc3 3.11.0-rc3 3.11.0-rc3 3.11.0-rc3 3.11.0-rc3 3.11.0-rc3 3.11.0-rc3
vanilla prepare-v6r favorpref-v6r scanshared-v6r splitprivate-v6r accountload-v6r retrymigrate-v6r
TPut 1 26752.00 ( 0.00%) 26143.00 ( -2.28%) 27475.00 ( 2.70%) 25905.00 ( -3.17%) 26159.00 ( -2.22%) 26752.00 ( 0.00%) 25766.00 ( -3.69%)
TPut 7 177228.00 ( 0.00%) 180918.00 ( 2.08%) 178629.00 ( 0.79%) 182270.00 ( 2.84%) 178194.00 ( 0.55%) 178862.00 ( 0.92%) 172865.00 ( -2.46%)
TPut 13 315823.00 ( 0.00%) 332697.00 ( 5.34%) 305875.00 ( -3.15%) 316406.00 ( 0.18%) 327239.00 ( 3.61%) 329285.00 ( 4.26%) 298184.00 ( -5.59%)
TPut 19 374121.00 ( 0.00%) 436339.00 ( 16.63%) 334925.00 (-10.48%) 355411.00 ( -5.00%) 439940.00 ( 17.59%) 415161.00 ( 10.97%) 400691.00 ( 7.10%)
TPut 25 414120.00 ( 0.00%) 489032.00 ( 18.09%) 371098.00 (-10.39%) 368906.00 (-10.92%) 525280.00 ( 26.84%) 444735.00 ( 7.39%) 472093.00 ( 14.00%)
TPut 31 402341.00 ( 0.00%) 477315.00 ( 18.63%) 374298.00 ( -6.97%) 375344.00 ( -6.71%) 508385.00 ( 26.36%) 410521.00 ( 2.03%) 464570.00 ( 15.47%)
TPut 37 421873.00 ( 0.00%) 470719.00 ( 11.58%) 362894.00 (-13.98%) 364194.00 (-13.67%) 499375.00 ( 18.37%) 398894.00 ( -5.45%) 454155.00 ( 7.65%)
TPut 43 386643.00 ( 0.00%) 443599.00 ( 14.73%) 344752.00 (-10.83%) 325270.00 (-15.87%) 446157.00 ( 15.39%) 355137.00 ( -8.15%) 401509.00 ( 3.84%)

So this was variable throughout the series. The preparation patches at least
made sense on their own. scanshared looks bad but that patch was adding
all cost with no benefit until private/shared faults are split. Overall
it's ok but massive room for improvement.

3.11.0-rc3 3.11.0-rc3 3.11.0-rc3 3.11.0-rc3 3.11.0-rc3 3.11.0-rc3 3.11.0-rc3
vanillaprepare-v6r favorpref-v6r scanshared-v6r splitprivate-v6r accountload-v6r retrymigrate-v6r
User 5195.50 5204.42 5214.58 5216.49 5180.43 5197.30 5184.02
System 68.87 61.46 72.01 72.28 85.52 71.44 78.47
Elapsed 252.94 253.70 254.62 252.98 253.06 253.49 253.15

Higher system CPU usage higher and was higher before scanning for shared
PTEs.

3.11.0-rc3 3.11.0-rc3 3.11.0-rc3 3.11.0-rc3 3.11.0-rc3 3.11.0-rc3 3.11.0-rc3
vanillaprepare-v6r favorpref-v6r scanshared-v6r splitprivate-v6r accountload-v6r retrymigrate-v6r
Page migrate success 1818805 1356595 2061245 1396578 4144428 4013443 4301319
Page migrate failure 0 0 0 0 0 0 0
Compaction pages isolated 0 0 0 0 0 0 0
Compaction migrate scanned 0 0 0 0 0 0 0
Compaction free scanned 0 0 0 0 0 0 0
Compaction cost 1887 1408 2139 1449 4301 4165 4464
NUMA PTE updates 17498156 14714823 18135699 13738286 15185177 16538890 17187537
NUMA hint faults 175555 79813 88041 64106 121892 111629 122575
NUMA hint local faults 115592 27968 38771 22257 38245 41230 55953
NUMA hint local percent 65 35 44 34 31 36 45
NUMA pages migrated 1818805 1356595 2061245 1396578 4144428 4013443 4301319
AutoNUMA cost 1034 527 606 443 794 750 814

And the higher CPU usage may be due to a much higher number of pages being
migrated. Looks like tasks are bouncing around quite a bit.

autonumabench
3.11.0-rc3 3.11.0-rc3 3.11.0-rc3 3.11.0-rc3 3.11.0-rc3 3.11.0-rc3 3.11.0-rc3
vanilla prepare-v6 favorpref-v6 scanshared-v6 splitprivate-v6 accountload-v6 retrymigrate-v6
User NUMA01 58160.75 ( 0.00%) 61893.02 ( -6.42%) 58204.95 ( -0.08%) 51066.78 ( 12.20%) 52787.02 ( 9.24%) 52799.39 ( 9.22%) 54846.75 ( 5.70%)
User NUMA01_THEADLOCAL 17419.30 ( 0.00%) 17555.95 ( -0.78%) 17629.84 ( -1.21%) 17725.93 ( -1.76%) 17196.56 ( 1.28%) 17015.65 ( 2.32%) 17314.51 ( 0.60%)
User NUMA02 2083.65 ( 0.00%) 4035.00 (-93.65%) 2259.24 ( -8.43%) 2267.69 ( -8.83%) 2073.19 ( 0.50%) 2072.39 ( 0.54%) 2066.83 ( 0.81%)
User NUMA02_SMT 995.28 ( 0.00%) 1023.44 ( -2.83%) 1085.39 ( -9.05%) 2057.87 (-106.76%) 989.89 ( 0.54%) 1005.46 ( -1.02%) 986.47 ( 0.89%)
System NUMA01 495.05 ( 0.00%) 272.96 ( 44.86%) 563.07 (-13.74%) 347.50 ( 29.81%) 528.57 ( -6.77%) 571.74 (-15.49%) 309.23 ( 37.54%)
System NUMA01_THEADLOCAL 101.82 ( 0.00%) 121.04 (-18.88%) 106.16 ( -4.26%) 108.09 ( -6.16%) 110.88 ( -8.90%) 105.33 ( -3.45%) 112.18 (-10.17%)
System NUMA02 6.32 ( 0.00%) 8.44 (-33.54%) 8.45 (-33.70%) 9.72 (-53.80%) 6.04 ( 4.43%) 6.50 ( -2.85%) 6.05 ( 4.27%)
System NUMA02_SMT 3.34 ( 0.00%) 3.30 ( 1.20%) 3.46 ( -3.59%) 3.53 ( -5.69%) 3.09 ( 7.49%) 3.65 ( -9.28%) 3.42 ( -2.40%)
Elapsed NUMA01 1308.52 ( 0.00%) 1372.86 ( -4.92%) 1297.49 ( 0.84%) 1151.22 ( 12.02%) 1183.57 ( 9.55%) 1185.22 ( 9.42%) 1237.37 ( 5.44%)
Elapsed NUMA01_THEADLOCAL 387.17 ( 0.00%) 386.75 ( 0.11%) 386.78 ( 0.10%) 398.48 ( -2.92%) 377.49 ( 2.50%) 368.04 ( 4.94%) 384.18 ( 0.77%)
Elapsed NUMA02 49.66 ( 0.00%) 94.02 (-89.33%) 53.66 ( -8.05%) 54.11 ( -8.96%) 49.38 ( 0.56%) 49.87 ( -0.42%) 49.66 ( 0.00%)
Elapsed NUMA02_SMT 46.62 ( 0.00%) 47.41 ( -1.69%) 50.73 ( -8.82%) 96.15 (-106.24%) 47.60 ( -2.10%) 53.47 (-14.69%) 49.12 ( -5.36%)
CPU NUMA01 4482.00 ( 0.00%) 4528.00 ( -1.03%) 4529.00 ( -1.05%) 4466.00 ( 0.36%) 4504.00 ( -0.49%) 4503.00 ( -0.47%) 4457.00 ( 0.56%)
CPU NUMA01_THEADLOCAL 4525.00 ( 0.00%) 4570.00 ( -0.99%) 4585.00 ( -1.33%) 4475.00 ( 1.10%) 4584.00 ( -1.30%) 4651.00 ( -2.78%) 4536.00 ( -0.24%)
CPU NUMA02 4208.00 ( 0.00%) 4300.00 ( -2.19%) 4226.00 ( -0.43%) 4208.00 ( 0.00%) 4210.00 ( -0.05%) 4167.00 ( 0.97%) 4174.00 ( 0.81%)
CPU NUMA02_SMT 2141.00 ( 0.00%) 2165.00 ( -1.12%) 2146.00 ( -0.23%) 2143.00 ( -0.09%) 2085.00 ( 2.62%) 1886.00 ( 11.91%) 2015.00 ( 5.89%)


Generally ok for the overall series. Interesting how numa02_smt is affected
by scanning shared ptes but addressed again when only using private faults for
task placement.

3.11.0-rc3 3.11.0-rc3 3.11.0-rc3 3.11.0-rc3 3.11.0-rc3 3.11.0-rc3 3.11.0-rc3
vanillaprepare-v6 favorpref-v6 scanshared-v6 splitprivate-v6 accountload-v6 retrymigrate-v6
User 78665.30 84513.85 79186.36 73125.19 73053.41 72899.54 75221.12
System 607.14 406.29 681.73 469.46 649.18 687.82 431.48
Elapsed 1800.42 1911.20 1799.31 1710.36 1669.53 1666.22 1729.92

Overall series reduces system CPU usage.

The following is SpecJBB running on with THP enabled and one JVM running per
NUMA node in the system.

specjbb
3.11.0-rc3 3.11.0-rc3 3.11.0-rc3 3.11.0-rc3 3.11.0-rc3 3.11.0-rc3 3.11.0-rc3
vanilla prepare-v6 favorpref-v6 scanshared-v6 splitprivate-v6 accountload-v6 retrymigrate-v6
Mean 1 30351.25 ( 0.00%) 30216.75 ( -0.44%) 30537.50 ( 0.61%) 29639.75 ( -2.34%) 31520.75 ( 3.85%) 31330.75 ( 3.23%) 31422.50 ( 3.53%)
Mean 10 114819.50 ( 0.00%) 128247.25 ( 11.69%) 129900.00 ( 13.13%) 126177.75 ( 9.89%) 135129.25 ( 17.69%) 130630.00 ( 13.77%) 126775.75 ( 10.41%)
Mean 19 119875.00 ( 0.00%) 124470.25 ( 3.83%) 124968.50 ( 4.25%) 119504.50 ( -0.31%) 124087.75 ( 3.51%) 121787.00 ( 1.59%) 125005.50 ( 4.28%)
Mean 28 121703.00 ( 0.00%) 120958.00 ( -0.61%) 124887.00 ( 2.62%) 123587.25 ( 1.55%) 123996.25 ( 1.88%) 119939.75 ( -1.45%) 120981.00 ( -0.59%)
Mean 37 121225.00 ( 0.00%) 120962.25 ( -0.22%) 120647.25 ( -0.48%) 121064.75 ( -0.13%) 115485.50 ( -4.73%) 115719.00 ( -4.54%) 123646.75 ( 2.00%)
Mean 46 121941.00 ( 0.00%) 127056.75 ( 4.20%) 115405.25 ( -5.36%) 119984.75 ( -1.60%) 115412.25 ( -5.35%) 111770.00 ( -8.34%) 127094.00 ( 4.23%)
Stddev 1 1711.82 ( 0.00%) 2160.62 (-26.22%) 1437.57 ( 16.02%) 1292.02 ( 24.52%) 1293.25 ( 24.45%) 1486.25 ( 13.18%) 1598.20 ( 6.64%)
Stddev 10 14943.91 ( 0.00%) 6974.79 ( 53.33%) 13344.66 ( 10.70%) 5891.26 ( 60.58%) 8336.20 ( 44.22%) 4203.26 ( 71.87%) 4874.50 ( 67.38%)
Stddev 19 5666.38 ( 0.00%) 4461.32 ( 21.27%) 9846.02 (-73.76%) 7664.08 (-35.26%) 6352.07 (-12.10%) 3119.54 ( 44.95%) 2932.69 ( 48.24%)
Stddev 28 4575.92 ( 0.00%) 3040.77 ( 33.55%) 10082.34 (-120.33%) 5236.45 (-14.43%) 6866.23 (-50.05%) 2378.94 ( 48.01%) 1937.93 ( 57.65%)
Stddev 37 2319.04 ( 0.00%) 7257.80 (-212.96%) 9296.46 (-300.87%) 3775.69 (-62.81%) 3822.41 (-64.83%) 2040.25 ( 12.02%) 1854.86 ( 20.02%)
Stddev 46 1138.20 ( 0.00%) 4288.72 (-276.80%) 9861.65 (-766.43%) 3338.54 (-193.32%) 3761.28 (-230.46%) 2105.55 (-84.99%) 4997.01 (-339.03%)
TPut 1 121405.00 ( 0.00%) 120867.00 ( -0.44%) 122150.00 ( 0.61%) 118559.00 ( -2.34%) 126083.00 ( 3.85%) 125323.00 ( 3.23%) 125690.00 ( 3.53%)
TPut 10 459278.00 ( 0.00%) 512989.00 ( 11.69%) 519600.00 ( 13.13%) 504711.00 ( 9.89%) 540517.00 ( 17.69%) 522520.00 ( 13.77%) 507103.00 ( 10.41%)
TPut 19 479500.00 ( 0.00%) 497881.00 ( 3.83%) 499874.00 ( 4.25%) 478018.00 ( -0.31%) 496351.00 ( 3.51%) 487148.00 ( 1.59%) 500022.00 ( 4.28%)
TPut 28 486812.00 ( 0.00%) 483832.00 ( -0.61%) 499548.00 ( 2.62%) 494349.00 ( 1.55%) 495985.00 ( 1.88%) 479759.00 ( -1.45%) 483924.00 ( -0.59%)
TPut 37 484900.00 ( 0.00%) 483849.00 ( -0.22%) 482589.00 ( -0.48%) 484259.00 ( -0.13%) 461942.00 ( -4.73%) 462876.00 ( -4.54%) 494587.00 ( 2.00%)
TPut 46 487764.00 ( 0.00%) 508227.00 ( 4.20%) 461621.00 ( -5.36%) 479939.00 ( -1.60%) 461649.00 ( -5.35%) 447080.00 ( -8.34%) 508376.00 ( 4.23%)

Performance here is a mixed bag. In terms of absolute performance it's
roughly the same and close to the noise although peak performance is improved
in all cases. On a more positive note, the variation in performance between
JVMs for the overall series is much reduced.

3.11.0-rc3 3.11.0-rc3 3.11.0-rc3 3.11.0-rc3 3.11.0-rc3 3.11.0-rc3 3.11.0-rc3
vanillaprepare-v6 favorpref-v6 scanshared-v6 splitprivate-v6 accountload-v6 retrymigrate-v6
User 54269.82 53933.58 53502.51 53123.89 54084.82 54073.35 54164.62
System 286.88 237.68 255.10 214.11 246.86 253.07 252.13
Elapsed 1230.49 1223.30 1215.55 1203.50 1228.03 1227.67 1222.97

And system CPU usage is slightly reduced

3.11.0-rc3 3.11.0-rc3 3.11.0-rc3 3.11.0-rc3 3.11.0-rc3 3.11.0-rc3 3.11.0-rc3
vanillaprepare-v6 favorpref-v6 scanshared-v6 splitprivate-v6 accountload-v6 retrymigrate-v6
Page migrate success 13046945 9345421 9547680 5999273 10045051 9777173 10238730
Page migrate failure 0 0 0 0 0 0 0
Compaction pages isolated 0 0 0 0 0 0 0
Compaction migrate scanned 0 0 0 0 0 0 0
Compaction free scanned 0 0 0 0 0 0 0
Compaction cost 13542 9700 9910 6227 10426 10148 10627
NUMA PTE updates 133187916 88422756 88624399 59087531 62208803 64097981 65891682
NUMA hint faults 2275465 1570765 1550779 1208413 1635952 1674890 1618290
NUMA hint local faults 678290 445712 420694 376197 527317 566505 543508
NUMA hint local percent 29 28 27 31 32 33 33
NUMA pages migrated 13046945 9345421 9547680 5999273 10045051 9777173 10238730
AutoNUMA cost 12557 8650 8555 6569 8806 9008 8747

Fewer pages are migrated but the percentage of local NUMA hint faults is
still depressingly low for what should be an ideal test case for automatic
NUMA placement. This workload is where I expect grouping related tasks
together on the same node to make a big difference.

I think this aspect of the patches is pretty much as far as it can get and
grouping related tasks together which Peter and Rik have been working on
is the next step.

Documentation/sysctl/kernel.txt | 73 +++++++
include/linux/migrate.h | 7 +-
include/linux/mm.h | 89 ++++++--
include/linux/mm_types.h | 14 +-
include/linux/page-flags-layout.h | 28 ++-
include/linux/sched.h | 24 ++-
kernel/fork.c | 3 -
kernel/sched/core.c | 31 ++-
kernel/sched/fair.c | 425 +++++++++++++++++++++++++++++++++-----
kernel/sched/features.h | 19 +-
kernel/sched/sched.h | 13 ++
kernel/sysctl.c | 7 +
mm/huge_memory.c | 62 ++++--
mm/memory.c | 73 +++----
mm/mempolicy.c | 8 +-
mm/migrate.c | 21 +-
mm/mm_init.c | 18 +-
mm/mmzone.c | 14 +-
mm/mprotect.c | 47 +++--
mm/page_alloc.c | 4 +-
20 files changed, 760 insertions(+), 220 deletions(-)

--
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/