[RFC PATCH 0/5] Memory compaction efficiency improvements

From: Vlastimil Babka
Date: Mon Nov 25 2013 - 09:26:53 EST


The broad goal of the series is to improve allocation success rates for huge
pages through memory compaction, while trying not to increase the compaction
overhead. The original objective was to reintroduce capturing of high-order
pages freed by the compaction, before they are split by concurrent activity.
However, several bugs and opportunities for simple improvements were found in
the current implementation, mostly through extra tracepoints (which are however
too ugly for now to be considered for sending).

The patches mostly deal with two mechanisms that reduce compaction overhead,
which is caching the progress of migrate and free scanners, and marking
pageblocks where isolation failed to be skipped during further scans.

Patch 1 encapsulates the some functionality for handling deferred compactions
for better maintainability, without a functional change
type is not determined without being actually needed.

Patch 2 fixes a bug where cached scanner pfn's are sometimes reset only after
they have been read to initialize a compaction run.

Patch 3 fixes a bug where scanners meeting is sometimes not properly detected
and can lead to multiple compaction attempts quitting early without
doing any work.

Patch 4 improves the chances of sync compaction to process pageblocks that
async compaction has skipped due to being !MIGRATE_MOVABLE.

Patch 5 improves the chances of sync direct compaction to actually do anything
when called after async compaction fails during allocation slowpath.


Some preliminary results with mmtests's stress-highalloc benchmark on a x86_64
machine with 4GB memory. First, the default GFP_HIGHUSER_MOVABLE allocations,
with the patches stacked on top of mainline master as of Friday (commit
a5d6e633 merging fixes from Andrew). Patch 1 is OK to serve as baseline due to
no functional change. Comments below.

stress-highalloc
master master master master master
1-nothp 2-nothp 3-nothp 4-nothp 5-nothp
Success 1 34.00 ( 0.00%) 20.00 ( 41.18%) 44.00 (-29.41%) 45.00 (-32.35%) 25.00 ( 26.47%)
Success 2 31.00 ( 0.00%) 21.00 ( 32.26%) 47.00 (-51.61%) 47.00 (-51.61%) 28.00 ( 9.68%)
Success 3 68.00 ( 0.00%) 88.00 (-29.41%) 86.00 (-26.47%) 87.00 (-27.94%) 88.00 (-29.41%)

master master master master master
1-nothp 2-nothp 3-nothp 4-nothp 5-nothp
User 6334.04 6343.09 5938.15 5860.00 6674.38
System 1044.15 1035.84 1022.68 1021.11 1055.76
Elapsed 1787.06 1714.76 1829.14 1850.91 1789.83

master master master master master
1-nothp 2-nothp 3-nothp 4-nothp 5-nothp
Minor Faults 248365069 244975796 247192462 243720231 248888409
Major Faults 427 442 563 504 414
Swap Ins 7 3 8 7 0
Swap Outs 345 338 570 235 415
Direct pages scanned 239929 166220 276238 277310 202409
Kswapd pages scanned 1759082 1819998 1880477 1850421 1809928
Kswapd pages reclaimed 1756781 1813653 1877783 1847704 1806347
Direct pages reclaimed 239291 165988 276163 277048 202092
Kswapd efficiency 99% 99% 99% 99% 99%
Kswapd velocity 984.344 1061.372 1028.066 999.736 1011.229
Direct efficiency 99% 99% 99% 99% 99%
Direct velocity 134.259 96.935 151.021 149.824 113.088
Percentage direct scans 12% 8% 12% 13% 10%
Zone normal velocity 362.126 440.499 374.597 354.049 360.196
Zone dma32 velocity 756.478 717.808 804.490 795.511 764.122
Zone dma velocity 0.000 0.000 0.000 0.000 0.000
Page writes by reclaim 450.000 476.000 570.000 306.000 639.000
Page writes file 105 138 0 71 224
Page writes anon 345 338 570 235 415
Page reclaim immediate 660 4407 167 843 1553
Sector Reads 2734844 2725576 2951744 2830472 2791216
Sector Writes 11938520 11729108 11769760 11743120 11805320
Page rescued immediate 0 0 0 0 0
Slabs scanned 1596544 1520768 1767552 1774720 1555584
Direct inode steals 9764 6640 14010 15320 8315
Kswapd inode steals 47445 42888 49705 51043 43283
Kswapd skipped wait 0 0 0 0 0
THP fault alloc 78 30 43 34 31
THP collapse alloc 485 371 570 559 306
THP splits 6 1 2 4 2
THP fault fallback 0 0 0 0 0
THP collapse fail 13 16 11 12 16
Compaction stalls 1067 1072 1629 1578 1140
Compaction success 339 275 568 595 329
Compaction failures 728 797 1061 983 811
Page migrate success 1115929 1113188 3966997 4076178 4220010
Page migrate failure 0 0 0 0 0
Compaction pages isolated 2423867 2425024 8351264 8583856 8789144
Compaction migrate scanned 38956505 62526876 153906340 174085307 114170442
Compaction free scanned 83126040 51071610 396724121 358193857 389459415
Compaction cost 1477 1639 5353 5612 5346
NUMA PTE updates 0 0 0 0 0
NUMA hint faults 0 0 0 0 0
NUMA hint local faults 0 0 0 0 0
NUMA hint local percent 100 100 100 100 100
NUMA pages migrated 0 0 0 0 0
AutoNUMA cost 0 0 0 0 0

Observations:
- The "Success 3" line is allocation success rate with system idle (phases 1
and 2 are with background interference). I used to get values around 85%
with vanilla 3.11 and observed occasional drop to around 65% in 3.12, with
about 50% chance. This was bisected to commit 81c0a2bb ("mm: page_alloc:
fair zone allocator policy") using 10 repeats of the benchmark and marking
as 'bad' a commit as long as the bad result appeared at least once (to fight
the uncertainty). As explained in comment for patch 3, I don't think the
commit is wrong, but that it makes the effect of bugs worse. From patch 3
onwards, the results are OK. Here it might seem that patch 2 helps, but
that's just the uncertainty. I plan to add support for more iterations and
statistical summarizing of the results to fight this...
- It might seem that patch 5 is regressing phases 1 and 2, but since that was
not the case when testing against 3.12, I would say it's just different
case of unstable results. Phases 1 and 2 are more amenable to that in
general. However, I never seen unpatched 3.11 or 3.12 go above 40% as
the patch 3 does.
- Compaction cost and number of scanned pages is higher, especially due to
patch 3. However, keep in mind that patches 2 and 3 fix existing bugs in the
current design of overhead mitigation, they do not change it. If overhead is
found unacceptable, then it should be decreased differently (and consistently,
not due to random conditions) than the current implementation does. In
contrast, patches 4 and 5 (which are not strictly bug fixes) do not
increase the overhead (but also not success rates).

Another set of preliminary results is when configuring stress-highalloc to
allocate with similar flags as THP uses:
(GFP_HIGHUSER_MOVABLE|__GFP_NOMEMALLOC|__GFP_NORETRY|__GFP_NO_KSWAPD)

stress-highalloc
master master master master master
1-thp 2-thp 3-thp 4-thp 5-thp
Success 1 29.00 ( 0.00%) 7.00 ( 75.86%) 25.00 ( 13.79%) 32.00 (-10.34%) 32.00 (-10.34%)
Success 2 30.00 ( 0.00%) 7.00 ( 76.67%) 29.00 ( 3.33%) 34.00 (-13.33%) 37.00 (-23.33%)
Success 3 70.00 ( 0.00%) 70.00 ( 0.00%) 85.00 (-21.43%) 85.00 (-21.43%) 85.00 (-21.43%)

master master master master master
1-thp 2-thp 3-thp 4-thp 5-thp
User 5915.36 6769.19 6350.04 6421.90 6571.80
System 1017.80 1053.70 1039.06 1051.84 1061.59
Elapsed 1757.87 1724.31 1744.66 1822.78 1841.42

master master master master master
1-thp 2-thp 3-thp 4-thp 5-thp
Minor Faults 246004967 248169249 244469991 248893104 245151725
Major Faults 403 282 354 369 436
Swap Ins 8 8 10 7 8
Swap Outs 534 530 325 694 687
Direct pages scanned 106122 76339 168386 202576 170449
Kswapd pages scanned 1924013 1803706 1855293 1872408 1907170
Kswapd pages reclaimed 1920762 1800403 1852989 1869573 1904070
Direct pages reclaimed 105986 76291 168183 202440 170343
Kswapd efficiency 99% 99% 99% 99% 99%
Kswapd velocity 1094.514 1046.045 1063.412 1027.227 1035.706
Direct efficiency 99% 99% 99% 99% 99%
Direct velocity 60.370 44.272 96.515 111.136 92.564
Percentage direct scans 5% 4% 8% 9% 8%
Zone normal velocity 362.047 386.497 361.529 371.628 369.295
Zone dma32 velocity 792.836 703.820 798.398 766.734 758.975
Zone dma velocity 0.000 0.000 0.000 0.000 0.000
Page writes by reclaim 741.000 751.000 325.000 694.000 924.000
Page writes file 207 221 0 0 237
Page writes anon 534 530 325 694 687
Page reclaim immediate 895 856 479 396 512
Sector Reads 2769992 2627604 2735740 2828672 2836412
Sector Writes 11748724 11660652 11598304 11800576 11753996
Page rescued immediate 0 0 0 0 0
Slabs scanned 1485952 1233024 1457280 1492096 1544320
Direct inode steals 2565 537 3384 6389 3205
Kswapd inode steals 50112 42207 46892 45371 49542
Kswapd skipped wait 0 0 0 0 0
THP fault alloc 28 2 23 31 28
THP collapse alloc 485 276 417 539 514
THP splits 0 0 0 2 3
THP fault fallback 0 0 0 0 0
THP collapse fail 13 19 17 12 12
Compaction stalls 813 474 964 1052 1050
Compaction success 332 92 359 434 411
Compaction failures 481 382 605 617 639
Page migrate success 582816 359101 973579 950980 1085585
Page migrate failure 0 0 0 0 0
Compaction pages isolated 1327894 806679 2256066 2195431 2461078
Compaction migrate scanned 13244945 7977159 21513942 23189436 30051866
Compaction free scanned 35192520 19254827 76152850 71159488 77702117
Compaction cost 722 443 1204 1191 1383
NUMA PTE updates 0 0 0 0 0
NUMA hint faults 0 0 0 0 0
NUMA hint local faults 0 0 0 0 0
NUMA hint local percent 100 100 100 100 100
NUMA pages migrated 0 0 0 0 0
AutoNUMA cost 0 0 0 0 0

master master master master master
1-thp 2-thp 3-thp 4-thp 5-thp
Mean sda-avgqz 46.01 46.31 46.43 46.87 45.94
Mean sda-await 271.19 273.75 273.84 270.12 269.69
Mean sda-r_await 35.33 35.52 34.26 33.98 33.61
Mean sda-w_await 474.54 497.59 603.64 567.32 488.48
Max sda-avgqz 158.33 168.62 166.68 165.51 165.82
Max sda-await 1461.41 1374.49 1380.31 1427.35 1402.61
Max sda-r_await 197.46 286.67 112.65 112.07 158.24
Max sda-w_await 9986.97 11363.36 16119.59 12365.75 11706.65

There are some differences from the previous results for THP-like allocations:
- Here, the bad result for unpatched kernel in phase 3 is much more consistent
to be between 65-70% and not due to the "regression" in 3.12. Still there is
the improvement from patch 3 onwards, which brings it on par with simple
GFP_HIGHUSER_MOVABLE allocations.
- Patch 2 is again not a regression but due to results variability.
- The compaction overhead in patches 2 and 3 and arguments are similar as
above.
- Patch 5 increases the number of migrate-scanned pages significantly. This
is most likely due to __GFP_NO_KSWAPD flag, which means the cached pfn's are
not reset by kswapd, and the patch thus helps the sync-after-async
compaction. It doesn't however show that the sync compaction would help with
success rates. One of the further patches I'm considering for future
versions is to ignore or clear pageblock skip information for sync
compaction. But in that case, THP clearly should be changed so that it does
not fallback to the sync compaction.




Vlastimil Babka (5):
mm: compaction: encapsulate defer reset logic
mm: compaction: reset cached scanner pfn's before reading them
mm: compaction: detect when scanners meet in isolate_freepages
mm: compaction: do not mark unmovable pageblocks as skipped in async
compaction
mm: compaction: reset scanner positions immediately when they meet

include/linux/compaction.h | 12 +++++++++++
mm/compaction.c | 53 ++++++++++++++++++++++++++++++----------------
mm/page_alloc.c | 5 +----
3 files changed, 48 insertions(+), 22 deletions(-)

--
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/