[PATCH 0/8] Reduce latencies and improve overall reclaim efficiency v2

From: Mel Gorman
Date: Wed Sep 15 2010 - 08:29:54 EST


This is v2 of a series to reduce some of the latencies seen in page reclaim
and to improve the efficiency a bit. There are a number of changes in this
revision. The first is to drop the patches avoiding writeback from direct
reclaim again. Wu asked me to look at a large number of his patches and I felt
it was best to do that independent of this series which should be relatively
uncontroversial. The second big change is to wait_iff_congested(). There
were a few complaints that the avoidance heuristic was way too fuzzy and
so I tried following Andrew's suggestion to take note of the return value
of bdi_write_congested() in may_write_to_queue() to identify when a zone
is congested.

Changelog since V2
o Reshuffle patches to order from least to most controversial
o Drop the patches dealing with writeback avoidance. Wu is working
on some patches that potentially collide with this area so it
will be revisited later
o Use BDI congestion feedback in wait_iff_congested() instead of
making a determination based on number of pages currently being
written back
o Do not use lock_page in pageout path
o Rebase to 2.6.36-rc4

Changelog since V1
o Fix mis-named function in documentation
o Added reviewed and acked bys

There have been numerous reports of stalls that pointed at the problem being
somewhere in the VM. There are multiple roots to the problems which means
dealing with any of the root problems in isolation is tricky to justify on
their own and they would still need integration testing. This patch series
puts together two different patch sets which in combination should tackle
some of the root causes of latency problems being reported.

Patch 1 adds a tracepoint for shrink_inactive_list. For this series, the
most important results is being able to calculate the scanning/reclaim
ratio as a measure of the amount of work being done by page reclaim.

Patch 2 accounts for time spent in congestion_wait.

Patches 3-6 were originally developed by Kosaki Motohiro but reworked for
this series. It has been noted that lumpy reclaim is far too aggressive and
trashes the system somewhat. As SLUB uses high-order allocations, a large
cost incurred by lumpy reclaim will be noticeable. It was also reported
during transparent hugepage support testing that lumpy reclaim was trashing
the system and these patches should mitigate that problem without disabling
lumpy reclaim.

Patch 7 adds wait_iff_congested() and replaces some callers of congestion_wait().
wait_iff_congested() only sleeps if there is a BDI that is currently congested.

Patch 8 notes that any BDI being congested is not necessarily a problem
because there could be multiple BDIs of varying speeds and numberous zones. It
attempts to track when a zone being reclaimed contains many pages backed
by a congested BDI and if so, reclaimers wait on the congestion queue.

I ran a number of tests with monitoring on X86, X86-64 and PPC64. Each
machine had 3G of RAM and the CPUs were

X86: Intel P4 2-core
X86-64: AMD Phenom 4-core
PPC64: PPC970MP

Each used a single disk and the onboard IO controller. Dirty ratio was left
at 20. I'm just going to report for X86-64 and PPC64 in a vague attempt to
keep this report short. Four kernels were tested each based on v2.6.36-rc4

traceonly-v2r2: Patches 1 and 2 to instrument vmscan reclaims and congestion_wait
lowlumpy-v2r3: Patches 1-6 to test if lumpy reclaim is better
waitcongest-v2r3: Patches 1-7 to only wait on congestion
waitwriteback-v2r4: Patches 1-8 to detect when a zone is congested

nocongest-v1r5: Patches 1-3 for testing wait_iff_congestion
nodirect-v1r5: Patches 1-10 to disable filesystem writeback for better IO

The tests run were as follows

kernbench
compile-based benchmark. Smoke test performance

sysbench
OLTP read-only benchmark. Will be re-run in the future as read-write

micro-mapped-file-stream
This is a micro-benchmark from Johannes Weiner that accesses a
large sparse-file through mmap(). It was configured to run in only
single-CPU mode but can be indicative of how well page reclaim
identifies suitable pages.

stress-highalloc
Tries to allocate huge pages under heavy load.

kernbench, iozone and sysbench did not report any performance regression
on any machine. sysbench did pressure the system lightly and there was reclaim
activity but there were no difference of major interest between the kernels.

X86-64 micro-mapped-file-stream

traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3 waitwriteback-v2r4
pgalloc_dma 1639.00 ( 0.00%) 667.00 (-145.73%) 1167.00 ( -40.45%) 578.00 (-183.56%)
pgalloc_dma32 2842410.00 ( 0.00%) 2842626.00 ( 0.01%) 2843043.00 ( 0.02%) 2843014.00 ( 0.02%)
pgalloc_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgsteal_dma 729.00 ( 0.00%) 85.00 (-757.65%) 609.00 ( -19.70%) 125.00 (-483.20%)
pgsteal_dma32 2338721.00 ( 0.00%) 2447354.00 ( 4.44%) 2429536.00 ( 3.74%) 2436772.00 ( 4.02%)
pgsteal_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_kswapd_dma 1469.00 ( 0.00%) 532.00 (-176.13%) 1078.00 ( -36.27%) 220.00 (-567.73%)
pgscan_kswapd_dma32 4597713.00 ( 0.00%) 4503597.00 ( -2.09%) 4295673.00 ( -7.03%) 3891686.00 ( -18.14%)
pgscan_kswapd_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_direct_dma 71.00 ( 0.00%) 134.00 ( 47.01%) 243.00 ( 70.78%) 352.00 ( 79.83%)
pgscan_direct_dma32 305820.00 ( 0.00%) 280204.00 ( -9.14%) 600518.00 ( 49.07%) 957485.00 ( 68.06%)
pgscan_direct_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pageoutrun 16296.00 ( 0.00%) 21254.00 ( 23.33%) 18447.00 ( 11.66%) 20067.00 ( 18.79%)
allocstall 443.00 ( 0.00%) 273.00 ( -62.27%) 513.00 ( 13.65%) 1568.00 ( 71.75%)

These are based on the raw figures taken from /proc/vmstat. It's a rough
measure of reclaim activity. Note that allocstall counts are higher because
we are entering direct reclaim more often as a result of not sleeping in
congestion. In itself, it's not necessarily a bad thing. It's easier to
get a view of what happened from the vmscan tracepoint report.

FTrace Reclaim Statistics: vmscan

traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3 waitwriteback-v2r4
Direct reclaims 443 273 513 1568
Direct reclaim pages scanned 305968 280402 600825 957933
Direct reclaim pages reclaimed 43503 19005 30327 117191
Direct reclaim write file async I/O 0 0 0 0
Direct reclaim write anon async I/O 0 3 4 12
Direct reclaim write file sync I/O 0 0 0 0
Direct reclaim write anon sync I/O 0 0 0 0
Wake kswapd requests 187649 132338 191695 267701
Kswapd wakeups 3 1 4 1
Kswapd pages scanned 4599269 4454162 4296815 3891906
Kswapd pages reclaimed 2295947 2428434 2399818 2319706
Kswapd reclaim write file async I/O 1 0 1 1
Kswapd reclaim write anon async I/O 59 187 41 222
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 4.34 2.52 6.63 2.96
Time kswapd awake (seconds) 11.15 10.25 11.01 10.19

Total pages scanned 4905237 4734564 4897640 4849839
Total pages reclaimed 2339450 2447439 2430145 2436897
%age total pages scanned/reclaimed 47.69% 51.69% 49.62% 50.25%
%age total pages scanned/written 0.00% 0.00% 0.00% 0.00%
%age file pages scanned/written 0.00% 0.00% 0.00% 0.00%
Percentage Time Spent Direct Reclaim 29.23% 19.02% 38.48% 20.25%
Percentage Time kswapd Awake 78.58% 78.85% 76.83% 79.86%

What is interesting here for nocongest in particular is that while direct
reclaim scans more pages, the overall number of pages scanned remains the same
and the ratio of pages scanned to pages reclaimed is more or less the same. In
other words, while we are sleeping less, reclaim is not doing more work and
as direct reclaim and kswapd is awake for less time, it would appear to be doing less work.

FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 87 196 64 0
Direct time congest waited 4604ms 4732ms 5420ms 0ms
Direct full congest waited 72 145 53 0
Direct number conditional waited 0 0 324 1315
Direct time conditional waited 0ms 0ms 0ms 0ms
Direct full conditional waited 0 0 0 0
KSwapd number congest waited 20 10 15 7
KSwapd time congest waited 1264ms 536ms 884ms 284ms
KSwapd full congest waited 10 4 6 2
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 0 0 0 0

The vanilla kernel spent 8 seconds asleep in direct reclaim and no time at
all asleep with the patches.

MMTests Statistics: duration
User/Sys Time Running Test (seconds) 10.51 10.73 10.6 11.66
Total Elapsed Time (seconds) 14.19 13.00 14.33 12.76

Overall, the tests completed faster. It is interesting to note that backing off further
when a zone is congested and not just a BDI was more efficient overall.

PPC64 micro-mapped-file-stream
pgalloc_dma 3024660.00 ( 0.00%) 3027185.00 ( 0.08%) 3025845.00 ( 0.04%) 3026281.00 ( 0.05%)
pgalloc_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgsteal_dma 2508073.00 ( 0.00%) 2565351.00 ( 2.23%) 2463577.00 ( -1.81%) 2532263.00 ( 0.96%)
pgsteal_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_kswapd_dma 4601307.00 ( 0.00%) 4128076.00 ( -11.46%) 3912317.00 ( -17.61%) 3377165.00 ( -36.25%)
pgscan_kswapd_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_direct_dma 629825.00 ( 0.00%) 971622.00 ( 35.18%) 1063938.00 ( 40.80%) 1711935.00 ( 63.21%)
pgscan_direct_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pageoutrun 27776.00 ( 0.00%) 20458.00 ( -35.77%) 18763.00 ( -48.04%) 18157.00 ( -52.98%)
allocstall 977.00 ( 0.00%) 2751.00 ( 64.49%) 2098.00 ( 53.43%) 5136.00 ( 80.98%)

Similar trends to x86-64. allocstalls are up but it's not necessarily bad.

FTrace Reclaim Statistics: vmscan
Direct reclaims 977 2709 2098 5136
Direct reclaim pages scanned 629825 963814 1063938 1711935
Direct reclaim pages reclaimed 75550 242538 150904 387647
Direct reclaim write file async I/O 0 0 0 2
Direct reclaim write anon async I/O 0 10 0 4
Direct reclaim write file sync I/O 0 0 0 0
Direct reclaim write anon sync I/O 0 0 0 0
Wake kswapd requests 392119 1201712 571935 571921
Kswapd wakeups 3 2 3 3
Kswapd pages scanned 4601307 4128076 3912317 3377165
Kswapd pages reclaimed 2432523 2318797 2312673 2144616
Kswapd reclaim write file async I/O 20 1 1 1
Kswapd reclaim write anon async I/O 57 132 11 121
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 6.19 7.30 13.04 10.88
Time kswapd awake (seconds) 21.73 26.51 25.55 23.90

Total pages scanned 5231132 5091890 4976255 5089100
Total pages reclaimed 2508073 2561335 2463577 2532263
%age total pages scanned/reclaimed 47.95% 50.30% 49.51% 49.76%
%age total pages scanned/written 0.00% 0.00% 0.00% 0.00%
%age file pages scanned/written 0.00% 0.00% 0.00% 0.00%
Percentage Time Spent Direct Reclaim 18.89% 20.65% 32.65% 27.65%
Percentage Time kswapd Awake 72.39% 80.68% 78.21% 77.40%

Again, a similar trend that the congestion_wait changes mean that direct
reclaim scans more pages but the overall number of pages scanned while
slightly reduced, are very similar. The ratio of scanning/reclaimed remains
roughly similar. The downside is that kswapd and direct reclaim was awake
longer and for a larger percentage of the overall workload. It's possible
there were big differences in the amount of time spent reclaiming slab
pages between the different kernels which is plausible considering that
the micro tests runs after fsmark and sysbench.

Trace Reclaim Statistics: congestion_wait
Direct number congest waited 845 1312 104 0
Direct time congest waited 19416ms 26560ms 7544ms 0ms
Direct full congest waited 745 1105 72 0
Direct number conditional waited 0 0 1322 2935
Direct time conditional waited 0ms 0ms 12ms 312ms
Direct full conditional waited 0 0 0 3
KSwapd number congest waited 39 102 75 63
KSwapd time congest waited 2484ms 6760ms 5756ms 3716ms
KSwapd full congest waited 20 48 46 25
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 0 0 0 0

The vanilla kernel spent 20 seconds asleep in direct reclaim and only 312ms
asleep with the patches. The time kswapd spent congest waited was also
reduced by a large factor.

MMTests Statistics: duration
ser/Sys Time Running Test (seconds) 26.58 28.05 26.9 28.47
Total Elapsed Time (seconds) 30.02 32.86 32.67 30.88

With all patches applies, the completion times are very similar.


X86-64 STRESS-HIGHALLOC
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3waitwriteback-v2r4
Pass 1 82.00 ( 0.00%) 84.00 ( 2.00%) 85.00 ( 3.00%) 85.00 ( 3.00%)
Pass 2 90.00 ( 0.00%) 87.00 (-3.00%) 88.00 (-2.00%) 89.00 (-1.00%)
At Rest 92.00 ( 0.00%) 90.00 (-2.00%) 90.00 (-2.00%) 91.00 (-1.00%)

Success figures across the board are broadly similar.

traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3waitwriteback-v2r4
Direct reclaims 1045 944 886 887
Direct reclaim pages scanned 135091 119604 109382 101019
Direct reclaim pages reclaimed 88599 47535 47863 46671
Direct reclaim write file async I/O 494 283 465 280
Direct reclaim write anon async I/O 29357 13710 16656 13462
Direct reclaim write file sync I/O 154 2 2 3
Direct reclaim write anon sync I/O 14594 571 509 561
Wake kswapd requests 7491 933 872 892
Kswapd wakeups 814 778 731 780
Kswapd pages scanned 7290822 15341158 11916436 13703442
Kswapd pages reclaimed 3587336 3142496 3094392 3187151
Kswapd reclaim write file async I/O 91975 32317 28022 29628
Kswapd reclaim write anon async I/O 1992022 789307 829745 849769
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 4588.93 2467.16 2495.41 2547.07
Time kswapd awake (seconds) 2497.66 1020.16 1098.06 1176.82

Total pages scanned 7425913 15460762 12025818 13804461
Total pages reclaimed 3675935 3190031 3142255 3233822
%age total pages scanned/reclaimed 49.50% 20.63% 26.13% 23.43%
%age total pages scanned/written 28.66% 5.41% 7.28% 6.47%
%age file pages scanned/written 1.25% 0.21% 0.24% 0.22%
Percentage Time Spent Direct Reclaim 57.33% 42.15% 42.41% 42.99%
Percentage Time kswapd Awake 43.56% 27.87% 29.76% 31.25%

Scanned/reclaimed ratios again look good with big improvements in
efficiency. The Scanned/written ratios also look much improved. With a
better scanned/written ration, there is an expectation that IO would be more
efficient and indeed, the time spent in direct reclaim is much reduced by
the full series and kswapd spends a little less time awake.

Overall, indications here are that allocations were
happening much faster and this can be seen with a graph of
the latency figures as the allocations were taking place
http://www.csn.ul.ie/~mel/postings/vmscanreduce-20101509/highalloc-interlatency-hydra-mean.ps

FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 1333 204 169 4
Direct time congest waited 78896ms 8288ms 7260ms 200ms
Direct full congest waited 756 92 69 2
Direct number conditional waited 0 0 26 186
Direct time conditional waited 0ms 0ms 0ms 2504ms
Direct full conditional waited 0 0 0 25
KSwapd number congest waited 4 395 227 282
KSwapd time congest waited 384ms 25136ms 10508ms 18380ms
KSwapd full congest waited 3 232 98 176
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 0 0 0 0
KSwapd full conditional waited 318 0 312 9


Overall, the time spent speeping is reduced. kswapd is still hitting
congestion_wait() but that is because there are callers remaining where it
wasn't clear in advance if they should be changed to wait_iff_congested()
or not. Overall the sleep imes are reduced though - from 79ish seconds to
about 19.

MMTests Statistics: duration
User/Sys Time Running Test (seconds) 3415.43 3386.65 3388.39 3377.5
Total Elapsed Time (seconds) 5733.48 3660.33 3689.41 3765.39

With the full series, the time to complete the tests are reduced by 30%

PPC64 STRESS-HIGHALLOC
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3waitwriteback-v2r4
Pass 1 17.00 ( 0.00%) 34.00 (17.00%) 38.00 (21.00%) 43.00 (26.00%)
Pass 2 25.00 ( 0.00%) 37.00 (12.00%) 42.00 (17.00%) 46.00 (21.00%)
At Rest 49.00 ( 0.00%) 43.00 (-6.00%) 45.00 (-4.00%) 51.00 ( 2.00%)

Success rates there are *way* up particularly considering that the 16MB
huge pages on PPC64 mean that it's always much harder to allocate them.

FTrace Reclaim Statistics: vmscan
stress-highalloc stress-highalloc stress-highalloc stress-highalloc
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3waitwriteback-v2r4
Direct reclaims 499 505 564 509
Direct reclaim pages scanned 223478 41898 51818 45605
Direct reclaim pages reclaimed 137730 21148 27161 23455
Direct reclaim write file async I/O 399 136 162 136
Direct reclaim write anon async I/O 46977 2865 4686 3998
Direct reclaim write file sync I/O 29 0 1 3
Direct reclaim write anon sync I/O 31023 159 237 239
Wake kswapd requests 420 351 360 326
Kswapd wakeups 185 294 249 277
Kswapd pages scanned 15703488 16392500 17821724 17598737
Kswapd pages reclaimed 5808466 2908858 3139386 3145435
Kswapd reclaim write file async I/O 159938 18400 18717 13473
Kswapd reclaim write anon async I/O 3467554 228957 322799 234278
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 9665.35 1707.81 2374.32 1871.23
Time kswapd awake (seconds) 9401.21 1367.86 1951.75 1328.88

Total pages scanned 15926966 16434398 17873542 17644342
Total pages reclaimed 5946196 2930006 3166547 3168890
%age total pages scanned/reclaimed 37.33% 17.83% 17.72% 17.96%
%age total pages scanned/written 23.27% 1.52% 1.94% 1.43%
%age file pages scanned/written 1.01% 0.11% 0.11% 0.08%
Percentage Time Spent Direct Reclaim 44.55% 35.10% 41.42% 36.91%
Percentage Time kswapd Awake 86.71% 43.58% 52.67% 41.14%

While the scanning rates are slightly up, the scanned/reclaimed and
scanned/written figures are much improved. The time spent in direct reclaim
and with kswapd are massively reduced, mostly by the lowlumpy patches.

FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 725 303 126 3
Direct time congest waited 45524ms 9180ms 5936ms 300ms
Direct full congest waited 487 190 52 3
Direct number conditional waited 0 0 200 301
Direct time conditional waited 0ms 0ms 0ms 1904ms
Direct full conditional waited 0 0 0 19
KSwapd number congest waited 0 2 23 4
KSwapd time congest waited 0ms 200ms 420ms 404ms
KSwapd full congest waited 0 2 2 4
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 0 0 0 0


Not as dramatic a story here but the time spent asleep is reduced and we can still
see what wait_iff_congested is going to sleep when necessary.

MMTests Statistics: duration
User/Sys Time Running Test (seconds) 12028.09 3157.17 3357.79 3199.16
Total Elapsed Time (seconds) 10842.07 3138.72 3705.54 3229.85

The time to complete this test goes way down. With the full series, we are allocating
over twice the number of huge pages in 30% of the time and there is a corresponding
impact on the allocation latency graph available at.

http://www.csn.ul.ie/~mel/postings/vmscanreduce-20101509/highalloc-interlatency-powyah-mean.ps

I think this series is ready for much wider testing. The lowlumpy patches in
particular should be relatively uncontroversial. While their largest impact
can be seen in the high order stress tests, they would also have an impact
if SLUB was configured (these tests are based on slab) and stalls in lumpy
reclaim could be partially responsible for some desktop stalling reports.

The congestion_wait avoidance stuff was controversial in v1 because the
heuristic used to avoid the wait was a bit shaky. I'm expecting that this
version is more predictable.

.../trace/postprocess/trace-vmscan-postprocess.pl | 39 +++-
include/linux/backing-dev.h | 2 +-
include/linux/mmzone.h | 8 +
include/trace/events/vmscan.h | 44 ++++-
include/trace/events/writeback.h | 35 +++
mm/backing-dev.c | 66 ++++++-
mm/page_alloc.c | 4 +-
mm/vmscan.c | 226 ++++++++++++++------
8 files changed, 341 insertions(+), 83 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/