[PATCH 0/9] Reduce latencies and improve overall reclaim efficiency v1

From: Mel Gorman
Date: Mon Sep 06 2010 - 06:48:45 EST


There have been numerous reports of stalls that pointed at the problem being
somewhere in the VM. There are multiple roots to the problems which means
dealing with any of the root problems in isolation is tricky to justify on
their own and they would still need integration testing. This patch series
gathers together three different patch sets which in combination should
tackle some of the root causes of latency problems being reported.

The first patch improves vmscan latency by tracking when pages get reclaimed
by shrink_inactive_list. For this series, the most important results is
being able to calculate the scanning/reclaim ratio as a measure of the
amount of work being done by page reclaim.

Patches 2 and 3 account for the time spent in congestion_wait() and avoids
calling going to sleep on congestion when it is unnecessary. This is expected
to reduce stalls in situations where the system is under memory pressure
but not due to congestion.

Patches 4-8 were originally developed by Kosaki Motohiro but reworked for
this series. It has been noted that lumpy reclaim is far too aggressive and
trashes the system somewhat. As SLUB uses high-order allocations, a large
cost incurred by lumpy reclaim will be noticeable. It was also reported
during transparent hugepage support testing that lumpy reclaim was trashing
the system and these patches should mitigate that problem without disabling
lumpy reclaim.

Patches 9-10 revisit avoiding filesystem writeback from direct reclaim. This has been
reported as being a potential cause of stack overflow but it can also result in poor IO
patterns increasing reclaim latencies.

There are patches similar to 9-10 already in mmotm but Andrew had concerns
about their impact. Hence, I revisisted them as the last part of this series
for re-evaluation.

I ran a number of tests with monitoring on X86, X86-64 and PPC64. Each
machine had 3G of RAM and the CPUs were

X86: Intel P4 2-core
X86-64: AMD Phenom 4-core
PPC64: PPC970MP

Each used a single disk and the onboard IO controller. Dirty ratio was left
at 20. I'm just going to report for X86-64 and PPC64 in a vague attempt to
keep this report short. Four kernels were tested each based on v2.6.36-rc3

traceonly-v1r5: Patches 1 and 2 to instrument vmscan reclaims and congestion_wait
nocongest-v1r5: Patches 1-3 for testing wait_iff_congestion
lowlumpy-v1r5: Patches 1-8 to test if lumpy reclaim is better
nodirect-v1r5: Patches 1-10 to disable filesystem writeback for better IO

The tests run were as follows

kernbench
compile-based benchmark. Smoke test performance

iozone
Smoke test performance, isn't putting the system under major stress

sysbench
OLTP read-only benchmark. Will be re-run in the future as read-write

micro-mapped-file-stream
This is a micro-benchmark from Johannes Weiner that accesses a
large sparse-file through mmap(). It was configured to run in only
single-CPU mode but can be indicative of how well page reclaim
identifies suitable pages.

stress-highalloc
Tries to allocate huge pages under heavy load.

kernbench, iozone and sysbench did not report any performance regression
on any machine and as they did not put the machine under memory pressure
the main paths this series deals with were not exercised. sysbench will be
re-run in the future with read-write testing as it is sensitive to writeback
performance under memory pressure. It is an oversight that it didn't happen
for this test.

X86-64 micro-mapped-file-stream
traceonly-v1r5 nocongest-v1r5 lowlumpy-v1r5 nodirect-v1r5
pgalloc_dma 2631.00 ( 0.00%) 2483.00 ( -5.96%) 2375.00 ( -10.78%) 2467.00 ( -6.65%)
pgalloc_dma32 2840528.00 ( 0.00%) 2841510.00 ( 0.03%) 2841391.00 ( 0.03%) 2842308.00 ( 0.06%)
pgalloc_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgsteal_dma 1383.00 ( 0.00%) 1182.00 ( -17.01%) 1177.00 ( -17.50%) 1181.00 ( -17.10%)
pgsteal_dma32 2237658.00 ( 0.00%) 2236581.00 ( -0.05%) 2219885.00 ( -0.80%) 2234527.00 ( -0.14%)
pgsteal_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_kswapd_dma 3006.00 ( 0.00%) 1400.00 (-114.71%) 1547.00 ( -94.31%) 1347.00 (-123.16%)
pgscan_kswapd_dma32 4206487.00 ( 0.00%) 3343082.00 ( -25.83%) 3425728.00 ( -22.79%) 3304369.00 ( -27.30%)
pgscan_kswapd_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_direct_dma 629.00 ( 0.00%) 1793.00 ( 64.92%) 1643.00 ( 61.72%) 1868.00 ( 66.33%)
pgscan_direct_dma32 506741.00 ( 0.00%) 1402557.00 ( 63.87%) 1330777.00 ( 61.92%) 1448345.00 ( 65.01%)
pgscan_direct_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pageoutrun 15449.00 ( 0.00%) 15555.00 ( 0.68%) 15319.00 ( -0.85%) 15963.00 ( 3.22%)
allocstall 152.00 ( 0.00%) 941.00 ( 83.85%) 967.00 ( 84.28%) 729.00 ( 79.15%)

These are just the raw figures taken from /proc/vmstat. It's a rough measure
of reclaim activity. Note that allocstall counts are higher because we
are entering direct reclaim more often as a result of not sleeping in
congestion. In itself, it's not necessarily a bad thing. It's easier to
get a view of what happened from the vmscan tracepoint report.

FTrace Reclaim Statistics: vmscan
micro-traceonly-v1r5-micromicro-nocongest-v1r5-micromicro-lowlumpy-v1r5-micromicro-nodirect-v1r5-micro
traceonly-v1r5 nocongest-v1r5 lowlumpy-v1r5 nodirect-v1r5
Direct reclaims 152 941 967 729
Direct reclaim pages scanned 507377 1404350 1332420 1450213
Direct reclaim pages reclaimed 10968 72042 77186 41097
Direct reclaim write file async I/O 0 0 0 0
Direct reclaim write anon async I/O 0 0 0 0
Direct reclaim write file sync I/O 0 0 0 0
Direct reclaim write anon sync I/O 0 0 0 0
Wake kswapd requests 127195 241025 254825 188846
Kswapd wakeups 6 1 1 1
Kswapd pages scanned 4210101 3345122 3427915 3306356
Kswapd pages reclaimed 2228073 2165721 2143876 2194611
Kswapd reclaim write file async I/O 0 0 0 0
Kswapd reclaim write anon async I/O 0 0 0 0
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 7.60 3.03 3.24 3.43
Time kswapd awake (seconds) 12.46 9.46 9.56 9.40

Total pages scanned 4717478 4749472 4760335 4756569
Total pages reclaimed 2239041 2237763 2221062 2235708
%age total pages scanned/reclaimed 47.46% 47.12% 46.66% 47.00%
%age total pages scanned/written 0.00% 0.00% 0.00% 0.00%
%age file pages scanned/written 0.00% 0.00% 0.00% 0.00%
Percentage Time Spent Direct Reclaim 43.80% 21.38% 22.34% 23.46%
Percentage Time kswapd Awake 79.92% 79.56% 79.20% 80.48%

What is interesting here for nocongest in particular is that while direct
reclaim scans more pages, the overall number of pages scanned remains the same
and the ratio of pages scanned to pages reclaimed is more or less the same. In
other words, while we are sleeping less, reclaim is not doing more work and
in fact, direct reclaim and kswapd is awake for less time. Overall, the series
reduces reclaim work.

FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 148 0 0 0
Direct time congest waited 8376ms 0ms 0ms 0ms
Direct full congest waited 127 0 0 0
Direct number conditional waited 0 711 693 627
Direct time conditional waited 0ms 0ms 0ms 0ms
Direct full conditional waited 127 0 0 0
KSwapd number congest waited 38 11 12 14
KSwapd time congest waited 3236ms 548ms 576ms 576ms
KSwapd full congest waited 31 3 3 2
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 31 3 3 2

The vanilla kernel spent 8 seconds asleep in direct reclaim and no time at
all asleep with the patches.

MMTests Statistics: duration
User/Sys Time Running Test (seconds) 9.75 11.14 11.26 11.19
Total Elapsed Time (seconds) 15.59 11.89 12.07 11.68

And overall, the tests complete significantly faster. Indicators are that
reclaim did less work and the test completed faster with fewer stalls. Seems
good.

PPC64 micro-mapped-file-stream
traceonly-v1r5 nocongest-v1r5 lowlumpy-v1r5 nodirect-v1r5
pgalloc_dma 3027144.00 ( 0.00%) 3025080.00 ( -0.07%) 3025463.00 ( -0.06%) 3026037.00 ( -0.04%)
pgalloc_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgsteal_dma 2399696.00 ( 0.00%) 2399540.00 ( -0.01%) 2399592.00 ( -0.00%) 2399570.00 ( -0.01%)
pgsteal_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_kswapd_dma 3690319.00 ( 0.00%) 2883661.00 ( -27.97%) 2852314.00 ( -29.38%) 3008323.00 ( -22.67%)
pgscan_kswapd_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_direct_dma 1224036.00 ( 0.00%) 1975664.00 ( 38.04%) 2012185.00 ( 39.17%) 1907869.00 ( 35.84%)
pgscan_direct_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pageoutrun 15170.00 ( 0.00%) 14636.00 ( -3.65%) 14664.00 ( -3.45%) 16027.00 ( 5.35%)
allocstall 712.00 ( 0.00%) 1906.00 ( 62.64%) 1912.00 ( 62.76%) 2027.00 ( 64.87%)

Similar trends to x86-64. allocstalls are up but it's not necessarily bad.

FTrace Reclaim Statistics: vmscan
micro-traceonly-v1r5-micromicro-nocongest-v1r5-micromicro-lowlumpy-v1r5-micromicro-nodirect-v1r5-micro
traceonly-v1r5 nocongest-v1r5 lowlumpy-v1r5 nodirect-v1r5
Direct reclaims 712 1906 1904 2021
Direct reclaim pages scanned 1224100 1975664 2010015 1906767
Direct reclaim pages reclaimed 79215 218292 202719 209388
Direct reclaim write file async I/O 0 0 0 0
Direct reclaim write anon async I/O 0 0 0 0
Direct reclaim write file sync I/O 0 0 0 0
Direct reclaim write anon sync I/O 0 0 0 0
Wake kswapd requests 1154724 805852 767944 848063
Kswapd wakeups 3 2 2 2
Kswapd pages scanned 3690799 2884173 2852026 3008835
Kswapd pages reclaimed 2320481 2181248 2195908 2189076
Kswapd reclaim write file async I/O 0 0 0 0
Kswapd reclaim write anon async I/O 0 0 0 0
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 21.02 7.19 7.72 6.76
Time kswapd awake (seconds) 39.55 25.31 24.88 24.83

Total pages scanned 4914899 4859837 4862041 4915602
Total pages reclaimed 2399696 2399540 2398627 2398464
%age total pages scanned/reclaimed 48.82% 49.37% 49.33% 48.79%
%age total pages scanned/written 0.00% 0.00% 0.00% 0.00%
%age file pages scanned/written 0.00% 0.00% 0.00% 0.00%
Percentage Time Spent Direct Reclaim 43.44% 19.64% 20.77% 18.43%
Percentage Time kswapd Awake 87.36% 81.94% 81.84% 81.28%

Again, a similar trend that the congestion_wait changes mean that direct reclaim
scans more pages but the overall number of pages scanned is very similar and
the ratio of scanning/reclaimed remains roughly similar. Once again, reclaim is
not doing more work, but spends less time in direct reclaim and with kswapd awake.

What is interesting here for nocongest in particular is that while direct
reclaim scans more pages, the overall number of pages scanned remains the same
and the ratio of pages scanned to pages reclaimed is more or less the same. In
other words, while we are sleeping less, reclaim is not doing more work and
in fact, direct reclaim and kswapd is awake for less time. Overall, the series
reduces reclaim work.

FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 499 0 0 0
Direct time congest waited 22700ms 0ms 0ms 0ms
Direct full congest waited 421 0 0 0
Direct number conditional waited 0 1214 1242 1290
Direct time conditional waited 0ms 4ms 0ms 0ms
Direct full conditional waited 421 0 0 0
KSwapd number congest waited 257 103 94 104
KSwapd time congest waited 22116ms 7344ms 7476ms 7528ms
KSwapd full congest waited 203 57 59 56
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 203 57 59 56

The vanilla kernel spent 22 seconds asleep in direct reclaim and no time at
all asleep with the patches. which is a big improvement. The time kswapd spent congest
waited was also reduced by a large factor.

MMTests Statistics: duration
User/Sys Time Running Test (seconds) 27.37 29.42 29.45 29.91
Total Elapsed Time (seconds) 45.27 30.89 30.40 30.55

And the test again completed far faster.

X86-64 STRESS-HIGHALLOC
stress-highalloc stress-highalloc stress-highalloc stress-highalloc
traceonly-v1r5 nocongest-v1r5 lowlumpy-v1r5 nodirect-v1r5
Pass 1 84.00 ( 0.00%) 84.00 ( 0.00%) 80.00 (-4.00%) 72.00 (-12.00%)
Pass 2 94.00 ( 0.00%) 94.00 ( 0.00%) 89.00 (-5.00%) 88.00 (-6.00%)
At Rest 95.00 ( 0.00%) 95.00 ( 0.00%) 95.00 ( 0.00%) 92.00 (-3.00%)

Success figures start dropping off for lowlumpy and nodirect. This ordinarily
would be a concern but the rest of the report paints a better picture.

FTrace Reclaim Statistics: vmscan
stress-highalloc stress-highalloc stress-highalloc stress-highalloc
traceonly-v1r5 nocongest-v1r5 lowlumpy-v1r5 nodirect-v1r5
Direct reclaims 838 1189 1323 1197
Direct reclaim pages scanned 182207 168696 146310 133117
Direct reclaim pages reclaimed 84208 81706 80442 54879
Direct reclaim write file async I/O 538 619 839 0
Direct reclaim write anon async I/O 36403 32892 44126 22085
Direct reclaim write file sync I/O 88 108 1 0
Direct reclaim write anon sync I/O 19107 15514 871 0
Wake kswapd requests 7761 827 865 6502
Kswapd wakeups 749 733 658 614
Kswapd pages scanned 6400676 6871918 6875056 3126591
Kswapd pages reclaimed 3122126 3376919 3001799 1669300
Kswapd reclaim write file async I/O 58199 67175 28483 925
Kswapd reclaim write anon async I/O 1740452 1851455 1680964 186578
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 3864.84 4426.77 3108.85 254.08
Time kswapd awake (seconds) 1792.00 2130.10 1890.76 343.37

Total pages scanned 6582883 7040614 7021366 3259708
Total pages reclaimed 3206334 3458625 3082241 1724179
%age total pages scanned/reclaimed 48.71% 49.12% 43.90% 52.89%
%age total pages scanned/written 28.18% 27.95% 25.00% 6.43%
%age file pages scanned/written 0.89% 0.96% 0.42% 0.03%
Percentage Time Spent Direct Reclaim 53.38% 56.75% 47.80% 8.44%
Percentage Time kswapd Awake 35.35% 37.88% 43.97% 23.01%

Scanned/reclaimed ratios again look good. The Scanned/written ratios look
very good for the nodirect patches showing that the writeback is happening
more in the flusher threads and less from direct reclaim. The expectation
is that the IO should be more efficient and indeed the time spent in direct
reclaim is massively reduced by the full series and kswapd spends a little
less time awake.

Overall, indications here are that things are moving much faster.

FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 1060 1 0 0
Direct time congest waited 63664ms 100ms 0ms 0ms
Direct full congest waited 617 1 0 0
Direct number conditional waited 0 1650 866 838
Direct time conditional waited 0ms 20296ms 1916ms 17652ms
Direct full conditional waited 617 1 0 0
KSwapd number congest waited 399 0 466 12
KSwapd time congest waited 33376ms 0ms 33048ms 968ms
KSwapd full congest waited 318 0 312 9
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 318 0 312 9

The sleep times for congest wait get interesting here. congestion_wait()
times are dropped to almost zero but wait_iff_congested() is detecting
when there is in fact congestion or too much writeback and still going to
sleep. Overall the times are reduced though - from 63ish seconds to about 20.
We are still backing off but less aggressively.


MMTests Statistics: duration
User/Sys Time Running Test (seconds) 3375.95 3374.04 3395.56 2756.97
Total Elapsed Time (seconds) 5068.80 5623.06 4300.45 1492.09

Oddly, the nocongest patches took longer to complete the test but the
overall series reduces the test time by almost an hour or about in one
third of the time. I also looked at the latency figures when allocating
huge pages and got this

http://www.csn.ul.ie/~mel/postings/vmscanreduce-20100609/highalloc-interlatency-hydra-mean.ps

So it looks like the latencies in general are reduced. The full series
reduces latency by massive amounts but there is also a hint why nocongest
was slower overall. Its latencies were lower up to the point where 72%
of memory was allocated with huge pages. After the latencies were higher
but this problem is resolved later in the series.

PPC64 STRESS-HIGHALLOC
traceonly-v1r5 nocongest-v1r5 lowlumpy-v1r5 nodirect-v1r5
Pass 1 27.00 ( 0.00%) 38.00 (11.00%) 31.00 ( 4.00%) 43.00 (16.00%)
Pass 2 41.00 ( 0.00%) 43.00 ( 2.00%) 33.00 (-8.00%) 55.00 (14.00%)
At Rest 84.00 ( 0.00%) 83.00 (-1.00%) 84.00 ( 0.00%) 85.00 ( 1.00%)

Success rates there are *way* up particularly considering that the 16MB
huge pages on PPC64 mean that it's always much harder to allocate them.

FTrace Reclaim Statistics: vmscan
stress-highalloc stress-highalloc stress-highalloc stress-highalloc
traceonly-v1r5 nocongest-v1r5 lowlumpy-v1r5 nodirect-v1r5
Direct reclaims 461 426 547 915
Direct reclaim pages scanned 193118 171811 143647 138334
Direct reclaim pages reclaimed 130100 108863 65954 63043
Direct reclaim write file async I/O 442 293 748 0
Direct reclaim write anon async I/O 52948 45149 29910 9949
Direct reclaim write file sync I/O 34 154 0 0
Direct reclaim write anon sync I/O 33128 27267 119 0
Wake kswapd requests 302 282 306 233
Kswapd wakeups 154 146 123 132
Kswapd pages scanned 13019861 12506267 3409775 3072689
Kswapd pages reclaimed 4839299 4782393 1908499 1723469
Kswapd reclaim write file async I/O 77348 77785 14580 214
Kswapd reclaim write anon async I/O 2878272 2840643 428083 142755
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 7692.01 7473.31 1044.76 217.31
Time kswapd awake (seconds) 7332.64 7171.23 1059.70 357.02

Total pages scanned 13212979 12678078 3553422 3211023
Total pages reclaimed 4969399 4891256 1974453 1786512
%age total pages scanned/reclaimed 37.61% 38.58% 55.56% 55.64%
%age total pages scanned/written 23.02% 23.59% 13.32% 4.76%
%age file pages scanned/written 0.59% 0.62% 0.43% 0.01%
Percentage Time Spent Direct Reclaim 42.66% 43.22% 26.30% 6.59%
Percentage Time kswapd Awake 82.06% 82.08% 45.82% 21.87%

Initially, it looks like the scanned/reclaimed ratios are much higher
and that's a bad thing. However, the number of pages scanned is reduced
by around 75% and the times spent in direct reclaim and with kswapd are
*massively* reduced. Overall the VM seems to be doing a lot less work.

FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 811 23 38 0
Direct time congest waited 40272ms 512ms 1496ms 0ms
Direct full congest waited 484 4 14 0
Direct number conditional waited 0 703 345 1281
Direct time conditional waited 0ms 22776ms 1312ms 10428ms
Direct full conditional waited 484 4 14 0
KSwapd number congest waited 1 0 6 6
KSwapd time congest waited 100ms 0ms 124ms 404ms
KSwapd full congest waited 1 0 1 2
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 1 0 1 2

Not as dramatic a story here but the time spent asleep is reduced and we can still
see what wait_iff_congested is going to sleep when necessary.

MMTests Statistics: duration
User/Sys Time Running Test (seconds) 10340.18 9818.41 2927.13 3078.91
Total Elapsed Time (seconds) 8936.19 8736.59 2312.71 1632.74

The time to complete this test goes way down. Take the allocation success
rates - we are allocating 16% more memory as huge pages in less than a
fifth of the time and this is reflected in the allocation latency data

http://www.csn.ul.ie/~mel/postings/vmscanreduce-20100609/highalloc-interlatency-powyah-mean.ps

I recognise that this is a weighty series but the desktop latency and other
stall issues are a tricky topic. There are multiple root causes as to what
might be causing them but I believe this series kicks a number of them.
I think the congestion_wait changes will also impact Dave Chinner's fs-mark
test that showed up in the minute-long livelock report but I'm hoping the
filesystem people that were complaining about latencies in the VM could
test this series with their respective workloads.

.../trace/postprocess/trace-vmscan-postprocess.pl | 39 +++-
include/linux/backing-dev.h | 2 +-
include/trace/events/vmscan.h | 44 ++++-
include/trace/events/writeback.h | 35 +++
mm/backing-dev.c | 71 ++++++-
mm/page_alloc.c | 4 +-
mm/vmscan.c | 253 +++++++++++++++-----
7 files changed, 368 insertions(+), 80 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/