[PATCH 0/13] Basic scheduler support for automatic NUMA balancing V2

From: Mel Gorman
Date: Wed Jul 03 2013 - 10:25:17 EST


This builds on the V1 series a bit. The performance still needs to be tied
down but it brings in a few more essential basics. Note that Peter has
posted another patch related to avoiding overloading compute nodes but I
have not had the chance to examine it yet. I'll be doing that after this
is posted as I decided not to postpone releasing this series as I'm two
days overdue to release an update.

Changelog since V2
o Scan pages with elevated map count (shared pages)
o Scale scan rates based on the vsz of the process so the sampling of the
task is independant of its size
o Favour moving towards nodes with more faults even if it's not the
preferred node
o Laughably basic accounting of a compute overloaded node when selecting
the preferred node.
o Applied review comments

This series integrates basic scheduler support for automatic NUMA balancing.
It borrows very heavily from Peter Ziljstra's work in "sched, numa, mm:
Add adaptive NUMA affinity support" but deviates too much to preserve
Signed-off-bys. As before, if the relevant authors are ok with it I'll
add Signed-off-bys (or add them yourselves if you pick the patches up).

This is still far from complete and there are known performance gaps
between this series and manual binding (when that is possible). As before,
the intention is not to complete the work but to incrementally improve
mainline and preserve bisectability for any bug reports that crop up. In
some cases performance may be worse unfortunately and when that happens
it will have to be judged if the system overhead is lower and if so,
is it still an acceptable direction as a stepping stone to something better.

Patch 1 adds sysctl documentation

Patch 2 tracks NUMA hinting faults per-task and per-node

Patches 3-5 selects a preferred node at the end of a PTE scan based on what
node incurrent the highest number of NUMA faults. When the balancer
is comparing two CPU it will prefer to locate tasks on their
preferred node.

Patch 6 reschedules a task when a preferred node is selected if it is not
running on that node already. This avoids waiting for the scheduler
to move the task slowly.

Patch 7 splits the accounting of faults between those that passed the
two-stage filter and those that did not. Task placement favours
the filtered faults initially although ultimately this will need
more smarts when node-local faults do not dominate.

Patch 8 replaces PTE scanning reset hammer and instread increases the
scanning rate when an otherwise settled task changes its
preferred node.

Patch 9 favours moving tasks towards nodes where more faults were incurred
even if it is not the preferred node

Patch 10 sets the scan rate proportional to the size of the task being scanned.

Patch 11 avoids some unnecessary allocation

Patch 12 kicks away some training wheels and scans shared pages

Patch 13 accounts for how many "preferred placed" tasks are running on an node
and attempts to avoid overloading them

Testing on this is only partial as full tests take a long time to run.

I tested 5 kernels using 3.9.0 as a basline

o 3.9.0-vanilla vanilla kernel with automatic numa balancing enabled
o 3.9.0-morefaults Patches 1-9
o 3.9.0-scalescan Patches 1-10
o 3.9.0-scanshared Patches 1-12
o 3.9.0-accountpreferred Patches 1-13

autonumabench
3.9.0 3.9.0 3.9.0 3.9.0 3.9.0
vanilla morefaults scalescan scanshared accountpreferred
User NUMA01 52623.86 ( 0.00%) 53408.85 ( -1.49%) 52042.73 ( 1.10%) 60404.32 (-14.79%) 57403.32 ( -9.08%)
User NUMA01_THEADLOCAL 17595.48 ( 0.00%) 17902.64 ( -1.75%) 18070.07 ( -2.70%) 18937.22 ( -7.63%) 17675.13 ( -0.45%)
User NUMA02 2043.84 ( 0.00%) 2029.40 ( 0.71%) 2183.84 ( -6.85%) 2173.80 ( -6.36%) 2259.45 (-10.55%)
User NUMA02_SMT 1057.11 ( 0.00%) 999.71 ( 5.43%) 1045.10 ( 1.14%) 1046.01 ( 1.05%) 1048.58 ( 0.81%)
System NUMA01 414.17 ( 0.00%) 328.68 ( 20.64%) 326.08 ( 21.27%) 155.69 ( 62.41%) 144.53 ( 65.10%)
System NUMA01_THEADLOCAL 105.17 ( 0.00%) 93.22 ( 11.36%) 97.63 ( 7.17%) 95.46 ( 9.23%) 102.47 ( 2.57%)
System NUMA02 9.36 ( 0.00%) 9.39 ( -0.32%) 9.25 ( 1.18%) 8.42 ( 10.04%) 10.46 (-11.75%)
System NUMA02_SMT 3.54 ( 0.00%) 3.32 ( 6.21%) 4.27 (-20.62%) 3.41 ( 3.67%) 3.72 ( -5.08%)
Elapsed NUMA01 1201.52 ( 0.00%) 1238.04 ( -3.04%) 1220.85 ( -1.61%) 1385.58 (-15.32%) 1335.06 (-11.11%)
Elapsed NUMA01_THEADLOCAL 393.91 ( 0.00%) 410.64 ( -4.25%) 414.33 ( -5.18%) 434.54 (-10.31%) 406.84 ( -3.28%)
Elapsed NUMA02 50.30 ( 0.00%) 50.30 ( 0.00%) 54.49 ( -8.33%) 52.14 ( -3.66%) 56.81 (-12.94%)
Elapsed NUMA02_SMT 58.48 ( 0.00%) 52.91 ( 9.52%) 58.71 ( -0.39%) 53.12 ( 9.17%) 60.82 ( -4.00%)
CPU NUMA01 4414.00 ( 0.00%) 4340.00 ( 1.68%) 4289.00 ( 2.83%) 4370.00 ( 1.00%) 4310.00 ( 2.36%)
CPU NUMA01_THEADLOCAL 4493.00 ( 0.00%) 4382.00 ( 2.47%) 4384.00 ( 2.43%) 4379.00 ( 2.54%) 4369.00 ( 2.76%)
CPU NUMA02 4081.00 ( 0.00%) 4052.00 ( 0.71%) 4024.00 ( 1.40%) 4184.00 ( -2.52%) 3995.00 ( 2.11%)
CPU NUMA02_SMT 1813.00 ( 0.00%) 1895.00 ( -4.52%) 1787.00 ( 1.43%) 1975.00 ( -8.94%) 1730.00 ( 4.58%)

3.9.0 3.9.0 3.9.0 3.9.0 3.9.0
vanillamorefaults scalescan scanshared accountpreferred
User 73328.02 74347.84 73349.21 82568.87 78393.27
System 532.89 435.24 437.90 263.61 261.79
Elapsed 1714.18 1763.03 1759.02 1936.17 1869.51

numa01 suffers a bit here but numa01 is also an adverse workload on this
machine. The result is poor but I'm not concentrating on it right now.

Just patches 1-9 (morefaults) performs ok. numa02 is flat and numa02_smt
sees a small performance gain. I do not have variance data to establish
if this is significant or not. After that, altering the scanning had a
large impact and I'll re-examine if the default scan rate is just too slow.

It's worth noting the impact on system CPU time. Overall it is much reduced
and I think we need to keep pushing for keeping the overhead as low as possible
particularly in the future as memory sizes grow.

3.9.0 3.9.0 3.9.0 3.9.0 3.9.0
vanillamorefaults scalescan scanshared accountpreferred
THP fault alloc 14325 14293 14103 14259 14081
THP collapse alloc 6 3 1 10 5
THP splits 4 5 5 3 2
THP fault fallback 0 0 0 0 0
THP collapse fail 0 0 0 0 0
Compaction stalls 0 0 0 0 0
Compaction success 0 0 0 0 0
Compaction failures 0 0 0 0 0
Page migrate success 9020528 5227450 5355703 5597558 5637844
Page migrate failure 0 0 0 0 0
Compaction pages isolated 0 0 0 0 0
Compaction migrate scanned 0 0 0 0 0
Compaction free scanned 0 0 0 0 0
Compaction cost 9363 5426 5559 5810 5852
NUMA PTE updates 119292401 79765854 73441393 76125744 75857594
NUMA hint faults 755901 384660 206195 214063 193969
NUMA hint local faults 595478 292221 120436 113812 109472
NUMA pages migrated 9020528 5227450 5355703 5597558 5637844
AutoNUMA cost 4785 2580 1646 1709 1607

Primary take-away point is to note the reduction in NUMA hinting faults
and PTE updates. Patches 1-9 incur about half the number of faults for
comparable overall performance.

This is SpecJBB running on a 4-socket machine with THP enabled and one JVM
running for the whole system. Only a limited number of clients are executed
to save on time. The full set is queued.

specjbb
3.9.0 3.9.0 3.9.0 3.9.0 3.9.0
vanilla morefaults scalescan scanshared accountpreferred
TPut 1 26099.00 ( 0.00%) 24848.00 ( -4.79%) 23990.00 ( -8.08%) 24350.00 ( -6.70%) 24248.00 ( -7.09%)
TPut 7 187276.00 ( 0.00%) 189731.00 ( 1.31%) 189065.00 ( 0.96%) 188680.00 ( 0.75%) 189774.00 ( 1.33%)
TPut 13 318028.00 ( 0.00%) 337374.00 ( 6.08%) 339016.00 ( 6.60%) 329143.00 ( 3.49%) 338743.00 ( 6.51%)
TPut 19 368547.00 ( 0.00%) 429440.00 ( 16.52%) 423973.00 ( 15.04%) 403563.00 ( 9.50%) 430941.00 ( 16.93%)
TPut 25 377522.00 ( 0.00%) 497621.00 ( 31.81%) 488108.00 ( 29.29%) 437813.00 ( 15.97%) 485013.00 ( 28.47%)
TPut 31 347642.00 ( 0.00%) 487253.00 ( 40.16%) 466366.00 ( 34.15%) 386972.00 ( 11.31%) 437104.00 ( 25.73%)
TPut 37 313439.00 ( 0.00%) 478601.00 ( 52.69%) 443415.00 ( 41.47%) 379081.00 ( 20.94%) 425452.00 ( 35.74%)
TPut 43 291958.00 ( 0.00%) 458614.00 ( 57.08%) 398195.00 ( 36.39%) 349661.00 ( 19.76%) 393102.00 ( 34.64%)

Pathces 1-9 again perform extremely well here. Reducing the scan rate
had an impact as did scanning shared pages which may indicate that the
shared/private identification is insufficient. Reducing the scan rate might
be the dominant factor as the tests are very short lived -- 30 seconds
each which is just 10 PTE scan windows. Basic accounting of compute load
helped again and overall the series was competetive.

specjbb Peaks
3.9.0 3.9.0 3.9.0 3.9.0 3.9.0
vanilla morefaults scalescan scanshared accountpreferred
Expctd Warehouse 48.00 ( 0.00%) 48.00 ( 0.00%) 48.00 ( 0.00%) 48.00 ( 0.00%) 48.00 ( 0.00%)
Actual Warehouse 26.00 ( 0.00%) 26.00 ( 0.00%) 26.00 ( 0.00%) 26.00 ( 0.00%) 26.00 ( 0.00%)
Actual Peak Bops 377522.00 ( 0.00%) 497621.00 ( 31.81%) 488108.00 ( 29.29%) 37813.00 ( 15.97%) 485013.00 ( 28.47%)

At least peak bops always improved.

3.9.0 3.9.0 3.9.0 3.9.0 3.9.0
vanillamorefaults scalescan scanshared accountpreferred
User 5184.53 5190.44 5195.01 5173.33 5185.88
System 59.61 58.91 61.47 73.84 64.81
Elapsed 254.52 254.17 254.12 254.80 254.55

Interestingly system CPU times were mixed. Scan shared incurred fewer
faults, migrated fewer pages and updated fewer PTEs so the time is being
lost elsewhere.

3.9.0 3.9.0 3.9.0 3.9.0 3.9.0
vanilla morefaults scalescan scanshared accountpreferred
THP fault alloc 33297 33251 34306 35144 33898
THP collapse alloc 9 8 14 8 16
THP splits 3 4 5 4 4
THP fault fallback 0 0 0 0 0
THP collapse fail 0 0 0 0 0
Compaction stalls 0 0 0 0 0
Compaction success 0 0 0 0 0
Compaction failures 0 0 0 0 0
Page migrate success 1773768 1716596 2075251 1815999 1858598
Page migrate failure 0 0 0 0 0
Compaction pages isolated 0 0 0 0 0
Compaction migrate scanned 0 0 0 0 0
Compaction free scanned 0 0 0 0 0
Compaction cost 1841 1781 2154 1885 1929
NUMA PTE updates 17461135 17268534 18766637 17602518 17406295
NUMA hint faults 85873 170092 86195 80027 84052
NUMA hint local faults 27145 116070 30651 28293 29919
NUMA pages migrated 1773768 1716596 2075251 1815999 1858598
AutoNUMA cost 585 1003 601 557 577

Not much of note there other than Patches 1-9 had a very high number of hinting faults
and it's not immediately obvious why.

I also ran SpecJBB running on with THP enabled and one JVM running per
NUMA node in the system. Similar to
the other test, it's only for a limited number of clients to save time.

specjbb
3.9.0 3.9.0 3.9.0 3.9.0 3.9.0
vanilla morefaults-v2r2 scalescan-v2r12 scanshared-v2r12accountpreferred-v2r12
Mean 1 30331.00 ( 0.00%) 31076.75 ( 2.46%) 30813.50 ( 1.59%) 30612.50 ( 0.93%) 30411.00 ( 0.26%)
Mean 7 150487.75 ( 0.00%) 153060.00 ( 1.71%) 155117.50 ( 3.08%) 152602.25 ( 1.41%) 151431.25 ( 0.63%)
Mean 13 130513.00 ( 0.00%) 135521.25 ( 3.84%) 136205.50 ( 4.36%) 135635.25 ( 3.92%) 130575.50 ( 0.05%)
Mean 19 123404.75 ( 0.00%) 131505.75 ( 6.56%) 126020.75 ( 2.12%) 127171.25 ( 3.05%) 119632.75 ( -3.06%)
Mean 25 116276.00 ( 0.00%) 120041.75 ( 3.24%) 117053.25 ( 0.67%) 121249.75 ( 4.28%) 112591.75 ( -3.17%)
Mean 31 108080.00 ( 0.00%) 113237.00 ( 4.77%) 113738.00 ( 5.24%) 114078.00 ( 5.55%) 106955.00 ( -1.04%)
Mean 37 102704.00 ( 0.00%) 107246.75 ( 4.42%) 113435.75 ( 10.45%) 111945.50 ( 9.00%) 106184.75 ( 3.39%)
Mean 43 98132.00 ( 0.00%) 105014.75 ( 7.01%) 109398.75 ( 11.48%) 106662.75 ( 8.69%) 103322.75 ( 5.29%)
Stddev 1 792.83 ( 0.00%) 1127.16 (-42.17%) 1321.59 (-66.69%) 1356.36 (-71.08%) 715.51 ( 9.75%)
Stddev 7 4080.34 ( 0.00%) 526.84 ( 87.09%) 3153.16 ( 22.72%) 3781.85 ( 7.32%) 2863.35 ( 29.83%)
Stddev 13 6614.16 ( 0.00%) 2086.04 ( 68.46%) 4139.26 ( 37.42%) 2486.95 ( 62.40%) 4066.48 ( 38.52%)
Stddev 19 2835.73 ( 0.00%) 1928.86 ( 31.98%) 4097.14 (-44.48%) 591.59 ( 79.14%) 3182.51 (-12.23%)
Stddev 25 3608.71 ( 0.00%) 3198.96 ( 11.35%) 5391.60 (-49.41%) 1606.37 ( 55.49%) 3326.21 ( 7.83%)
Stddev 31 2778.25 ( 0.00%) 784.02 ( 71.78%) 6802.53 (-144.85%) 1738.20 ( 37.44%) 1126.27 ( 59.46%)
Stddev 37 4069.13 ( 0.00%) 5009.93 (-23.12%) 5022.13 (-23.42%) 4191.94 ( -3.02%) 1031.05 ( 74.66%)
Stddev 43 9215.73 ( 0.00%) 5589.12 ( 39.35%) 8915.80 ( 3.25%) 8042.72 ( 12.73%) 3113.04 ( 66.22%)
TPut 1 121324.00 ( 0.00%) 124307.00 ( 2.46%) 123254.00 ( 1.59%) 122450.00 ( 0.93%) 121644.00 ( 0.26%)
TPut 7 601951.00 ( 0.00%) 612240.00 ( 1.71%) 620470.00 ( 3.08%) 610409.00 ( 1.41%) 605725.00 ( 0.63%)
TPut 13 522052.00 ( 0.00%) 542085.00 ( 3.84%) 544822.00 ( 4.36%) 542541.00 ( 3.92%) 522302.00 ( 0.05%)
TPut 19 493619.00 ( 0.00%) 526023.00 ( 6.56%) 504083.00 ( 2.12%) 508685.00 ( 3.05%) 478531.00 ( -3.06%)
TPut 25 465104.00 ( 0.00%) 480167.00 ( 3.24%) 468213.00 ( 0.67%) 484999.00 ( 4.28%) 450367.00 ( -3.17%)
TPut 31 432320.00 ( 0.00%) 452948.00 ( 4.77%) 454952.00 ( 5.24%) 456312.00 ( 5.55%) 427820.00 ( -1.04%)
TPut 37 410816.00 ( 0.00%) 428987.00 ( 4.42%) 453743.00 ( 10.45%) 447782.00 ( 9.00%) 424739.00 ( 3.39%)
TPut 43 392528.00 ( 0.00%) 420059.00 ( 7.01%) 437595.00 ( 11.48%) 426651.00 ( 8.69%) 413291.00 ( 5.29%)

These are the mean throughput figures between JVMs and the standard
deviation. Note that with the patches applied that there is a lot less
deviation between JVMs in many cases. As the number of clients increases
the performance improves. This is still far short of the theoritical best
performance but it's a step in the right direction.

specjbb Peaks
3.9.0 3.9.0 3.9.0 3.9.0 3.9.0
vanilla morefaults-v2r2 scalescan-v2r12 scanshared-v2r12 accountpreferred-v2r12
Expctd Warehouse 12.00 ( 0.00%) 12.00 ( 0.00%) 12.00 ( 0.00%) 12.00 ( 0.00%) 12.00 ( 0.00%)
Actual Warehouse 8.00 ( 0.00%) 8.00 ( 0.00%) 8.00 ( 0.00%) 8.00 ( 0.00%) 8.00 ( 0.00%)
Actual Peak Bops 601951.00 ( 0.00%) 612240.00 ( 1.71%) 620470.00 ( 3.08%) 610409.00 ( 1.41%) 605725.00 ( 0.63%)

Peaks are only marginally improved even though many of the individual
throughput figures look ok.

3.9.0 3.9.0 3.9.0 3.9.0 3.9.0
vanillamorefaults-v2r2scalescan-v2r12scanshared-v2r12accountpreferred-v2r12
User 78020.94 77250.13 78334.69 78027.78 77752.27
System 305.94 261.17 228.74 234.28 240.04
Elapsed 1744.52 1717.79 1744.31 1742.62 1730.79

And the performance is improved with a healthy reduction in system CPU time

3.9.0 3.9.0 3.9.0 3.9.0 3.9.0
vanillamorefaults-v2r2scalescan-v2r12scanshared-v2r12accountpreferred-v2r12
THP fault alloc 65433 64779 64234 65547 63519
THP collapse alloc 51 54 58 55 55
THP splits 55 49 46 51 56
THP fault fallback 0 0 0 0 0
THP collapse fail 0 0 0 0 0
Compaction stalls 0 0 0 0 0
Compaction success 0 0 0 0 0
Compaction failures 0 0 0 0 0
Page migrate success 20348847 15323475 11375529 11777597 12110444
Page migrate failure 0 0 0 0 0
Compaction pages isolated 0 0 0 0 0
Compaction migrate scanned 0 0 0 0 0
Compaction free scanned 0 0 0 0 0
Compaction cost 21122 15905 11807 12225 12570
NUMA PTE updates 180124094 145320534 109608785 108346894 107390100
NUMA hint faults 2358728 1623277 1489903 1472556 1378097
NUMA hint local faults 835051 603375 585183 516949 425342
NUMA pages migrated 20348847 15323475 11375529 11777597 12110444
AutoNUMA cost 13441 9424 8432 8344 7872

Much fewer PTE updates and faults.

The performance is still a mixed bag. Patches 1-9 are generally
good. Conceptually I think the other patches make sense but need a bit
more love. The last patch in particularly will be replaced with more of
Peter's work.

Documentation/sysctl/kernel.txt | 68 +++++++++
include/linux/migrate.h | 7 +-
include/linux/mm_types.h | 3 -
include/linux/sched.h | 21 ++-
include/linux/sched/sysctl.h | 1 -
kernel/sched/core.c | 33 +++-
kernel/sched/fair.c | 330 ++++++++++++++++++++++++++++++++++++----
kernel/sched/sched.h | 16 ++
kernel/sysctl.c | 14 +-
mm/huge_memory.c | 7 +-
mm/memory.c | 13 +-
mm/migrate.c | 17 +--
mm/mprotect.c | 4 +-
13 files changed, 462 insertions(+), 72 deletions(-)

--
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/