[PATCH 0/12] Avoid overflowing of stack during page reclaim V2

From: Mel Gorman
Date: Mon Jun 14 2010 - 07:18:52 EST


This is a merging of two series - the first of which reduces stack usage
in page reclaim and the second which writes contiguous pages during reclaim
and avoids writeback in direct reclaimers.

Changelog since V1
o Merge with series that reduces stack usage in page reclaim in general
o Allow memcg to writeback pages as they are not expected to overflow stack
o Drop the contiguous-write patch for the moment

There is a problem in the stack depth usage of page reclaim. Particularly
during direct reclaim, it is possible to overflow the stack if it calls
into the filesystems writepage function. This patch series aims to trace
writebacks so it can be evaulated how many dirty pages are being written,
reduce stack usage of page reclaim in general and avoid direct reclaim
writing back pages and overflowing the stack.

The first 4 patches are a forward-port of trace points that are partly based
on trace points defined by Larry Woodman but never merged. They trace parts of
kswapd, direct reclaim, LRU page isolation and page writeback. The tracepoints
can be used to evaluate what is happening within reclaim and whether things
are getting better or worse. They do not have to be part of the final series
but might be useful during discussion and for later regression testing -
particularly around the percentage of time spent in reclaim.

The 6 patches after that reduce the stack footprint of page reclaim by moving
large allocations out of the main call path. Functionally they should be
similar although there is a timing change on when pages get freed exactly.
This is aimed at giving filesystems as much stack as possible if kswapd is
to writeback pages directly.

Patch 11 puts dirty pages as it finds them onto a temporary list and then
writes them all out with a helper function. This simplifies patch 12 and
also increases the chances that IO requests can be optimally merged.

Patch 12 prevents direct reclaim writing out pages at all and instead dirty
pages are put back on the LRU. For lumpy reclaim, the caller will briefly
wait on dirty pages to be written out before trying to reclaim the dirty
pages a second time. This increases the responsibility of kswapd somewhat
because it's now cleaning pages on behalf of direct reclaimers but kswapd
seemed a better fit than background flushers to clean pages as it knows
where the pages needing cleaning are. As it's async IO, it should not cause
kswapd to stall (at least until the queue is congested) but the order that
pages are reclaimed on the LRU is altered. Dirty pages that would have been
reclaimed by direct reclaimers are getting another lap on the LRU. The
dirty pages could have been put on a dedicated list but this increased
counter overhead and the number of lists and it is unclear if it is necessary.

Apologies for the length of the rest of the mail. Measuring the impact of
this is not exactly straight-forward.

I ran a number of tests with monitoring on X86, X86-64 and PPC64 and I'll
cover what the X86-64 results were here. It's an AMD Phenom 4-core machine
with 2G of RAM with a single disk and the onboard IO controller. Dirty
ratio was left at 20 (tests with 40 are in progress). The filesystem all
the tests were run on was XFS.

Three kernels are compared.

traceonly-v2r5 is the first 4 patches of this series
stackreduce-v2r5 is the first 10 patches of this series
nodirect-v2r5 is all patches in the series

The results on each test is broken up into three parts. The first part
compares the results of the test itself. The second part is a report based
on the ftrace postprocessing script in patch 4 and reports on direct reclaim
and kswapd activity. The third part reports what percentage of time was
spent in direct reclaim and kswapd being awake.

To work out the percentage of time spent in direct reclaim, I used
/usr/bin/time to get the User + Sys CPU time. The stalled time was taken
from the post-processing script. The total time is (User + Sys + Stall)
and obviously the percentage is of stalled over total time.

kernbench
=========

traceonly-v2r5 stackreduce-v2r5 nodirect-v2r5
Elapsed min 98.16 ( 0.00%) 97.95 ( 0.21%) 98.25 (-0.09%)
Elapsed mean 98.29 ( 0.00%) 98.26 ( 0.03%) 98.40 (-0.11%)
Elapsed stddev 0.08 ( 0.00%) 0.20 (-165.90%) 0.12 (-59.87%)
Elapsed max 98.34 ( 0.00%) 98.51 (-0.17%) 98.58 (-0.24%)
User min 311.03 ( 0.00%) 311.74 (-0.23%) 311.13 (-0.03%)
User mean 311.28 ( 0.00%) 312.45 (-0.38%) 311.42 (-0.05%)
User stddev 0.24 ( 0.00%) 0.51 (-114.08%) 0.38 (-59.06%)
User max 311.58 ( 0.00%) 312.94 (-0.44%) 312.06 (-0.15%)
System min 40.54 ( 0.00%) 39.65 ( 2.20%) 40.34 ( 0.49%)
System mean 40.80 ( 0.00%) 40.01 ( 1.93%) 40.81 (-0.03%)
System stddev 0.23 ( 0.00%) 0.34 (-47.57%) 0.29 (-25.47%)
System max 41.04 ( 0.00%) 40.51 ( 1.29%) 41.11 (-0.17%)
CPU min 357.00 ( 0.00%) 357.00 ( 0.00%) 357.00 ( 0.00%)
CPU mean 357.75 ( 0.00%) 358.00 (-0.07%) 357.75 ( 0.00%)
CPU stddev 0.43 ( 0.00%) 0.71 (-63.30%) 0.43 ( 0.00%)
CPU max 358.00 ( 0.00%) 359.00 (-0.28%) 358.00 ( 0.00%)

FTrace Reclaim Statistics
traceonly-v2r5 stackreduce-v2r5 nodirect-v2r5
Direct reclaims 0 0 0
Direct reclaim pages scanned 0 0 0
Direct reclaim write async I/O 0 0 0
Direct reclaim write sync I/O 0 0 0
Wake kswapd requests 0 0 0
Kswapd wakeups 0 0 0
Kswapd pages scanned 0 0 0
Kswapd reclaim write async I/O 0 0 0
Kswapd reclaim write sync I/O 0 0 0
Time stalled direct reclaim (ms) 0.00 0.00 0.00
Time kswapd awake (ms) 0.00 0.00 0.00

User/Sys Time Running Test (seconds) 2144.58 2146.22 2144.8
Percentage Time Spent Direct Reclaim 0.00% 0.00% 0.00%
Total Elapsed Time (seconds) 788.42 794.85 793.66
Percentage Time kswapd Awake 0.00% 0.00% 0.00%

kernbench is a straight-forward kernel compile. Kernel is built 5 times and and an
average taken. There was no interesting difference in terms of performance. As the
workload fit easily in memory, there was no page reclaim activity.

IOZone
======
traceonly-v2r5 stackreduce-v2r5 nodirect-v2r5
write-64 395452 ( 0.00%) 397208 ( 0.44%) 397796 ( 0.59%)
write-128 463696 ( 0.00%) 460514 (-0.69%) 458940 (-1.04%)
write-256 504861 ( 0.00%) 506050 ( 0.23%) 502969 (-0.38%)
write-512 490875 ( 0.00%) 485767 (-1.05%) 494264 ( 0.69%)
write-1024 497574 ( 0.00%) 489689 (-1.61%) 505956 ( 1.66%)
write-2048 500993 ( 0.00%) 503076 ( 0.41%) 510097 ( 1.78%)
write-4096 504491 ( 0.00%) 502073 (-0.48%) 506993 ( 0.49%)
write-8192 488398 ( 0.00%) 228857 (-113.41%) 313871 (-55.60%)
write-16384 409006 ( 0.00%) 433783 ( 5.71%) 365696 (-11.84%)
write-32768 473136 ( 0.00%) 481153 ( 1.67%) 486373 ( 2.72%)
write-65536 474833 ( 0.00%) 477970 ( 0.66%) 481192 ( 1.32%)
write-131072 429557 ( 0.00%) 452604 ( 5.09%) 459840 ( 6.59%)
write-262144 397934 ( 0.00%) 401955 ( 1.00%) 397479 (-0.11%)
write-524288 222849 ( 0.00%) 230297 ( 3.23%) 209999 (-6.12%)
rewrite-64 1452528 ( 0.00%) 1492919 ( 2.71%) 1452528 ( 0.00%)
rewrite-128 1622919 ( 0.00%) 1618028 (-0.30%) 1663139 ( 2.42%)
rewrite-256 1694118 ( 0.00%) 1704877 ( 0.63%) 1639786 (-3.31%)
rewrite-512 1753325 ( 0.00%) 1740536 (-0.73%) 1730717 (-1.31%)
rewrite-1024 1741104 ( 0.00%) 1787480 ( 2.59%) 1759651 ( 1.05%)
rewrite-2048 1710867 ( 0.00%) 1747411 ( 2.09%) 1747411 ( 2.09%)
rewrite-4096 1583280 ( 0.00%) 1621536 ( 2.36%) 1599942 ( 1.04%)
rewrite-8192 1308005 ( 0.00%) 1338579 ( 2.28%) 1307358 (-0.05%)
rewrite-16384 1293742 ( 0.00%) 1314178 ( 1.56%) 1291602 (-0.17%)
rewrite-32768 1298360 ( 0.00%) 1314503 ( 1.23%) 1276758 (-1.69%)
rewrite-65536 1289212 ( 0.00%) 1316088 ( 2.04%) 1281351 (-0.61%)
rewrite-131072 1286117 ( 0.00%) 1309070 ( 1.75%) 1283007 (-0.24%)
rewrite-262144 1285902 ( 0.00%) 1305816 ( 1.53%) 1274121 (-0.92%)
rewrite-524288 220417 ( 0.00%) 223971 ( 1.59%) 226133 ( 2.53%)
read-64 3203069 ( 0.00%) 2467108 (-29.83%) 3541098 ( 9.55%)
read-128 3759450 ( 0.00%) 4267461 (11.90%) 4233807 (11.20%)
read-256 4264168 ( 0.00%) 4350555 ( 1.99%) 3935921 (-8.34%)
read-512 3437042 ( 0.00%) 3366987 (-2.08%) 3437042 ( 0.00%)
read-1024 3738636 ( 0.00%) 3821805 ( 2.18%) 3735385 (-0.09%)
read-2048 3938881 ( 0.00%) 3984558 ( 1.15%) 3967993 ( 0.73%)
read-4096 3631489 ( 0.00%) 3828122 ( 5.14%) 3781775 ( 3.97%)
read-8192 3175046 ( 0.00%) 3230268 ( 1.71%) 3247058 ( 2.22%)
read-16384 2923635 ( 0.00%) 2869911 (-1.87%) 2954684 ( 1.05%)
read-32768 2819040 ( 0.00%) 2839776 ( 0.73%) 2852152 ( 1.16%)
read-65536 2659324 ( 0.00%) 2827502 ( 5.95%) 2816464 ( 5.58%)
read-131072 2707652 ( 0.00%) 2727534 ( 0.73%) 2746406 ( 1.41%)
read-262144 2765929 ( 0.00%) 2782166 ( 0.58%) 2776125 ( 0.37%)
read-524288 2810803 ( 0.00%) 2822894 ( 0.43%) 2822394 ( 0.41%)
reread-64 5389653 ( 0.00%) 5860307 ( 8.03%) 5735102 ( 6.02%)
reread-128 5122535 ( 0.00%) 5325799 ( 3.82%) 5325799 ( 3.82%)
reread-256 3245838 ( 0.00%) 3285566 ( 1.21%) 3236056 (-0.30%)
reread-512 4340054 ( 0.00%) 4571003 ( 5.05%) 4742616 ( 8.49%)
reread-1024 4265934 ( 0.00%) 4356809 ( 2.09%) 4374559 ( 2.48%)
reread-2048 3915540 ( 0.00%) 4301837 ( 8.98%) 4338776 ( 9.75%)
reread-4096 3846119 ( 0.00%) 3984379 ( 3.47%) 3979764 ( 3.36%)
reread-8192 3257215 ( 0.00%) 3304518 ( 1.43%) 3325949 ( 2.07%)
reread-16384 2959519 ( 0.00%) 2892622 (-2.31%) 2995773 ( 1.21%)
reread-32768 2570835 ( 0.00%) 2607266 ( 1.40%) 2610783 ( 1.53%)
reread-65536 2731466 ( 0.00%) 2683809 (-1.78%) 2691957 (-1.47%)
reread-131072 2738144 ( 0.00%) 2763886 ( 0.93%) 2776056 ( 1.37%)
reread-262144 2781012 ( 0.00%) 2781786 ( 0.03%) 2784322 ( 0.12%)
reread-524288 2787049 ( 0.00%) 2779851 (-0.26%) 2787681 ( 0.02%)
randread-64 1204796 ( 0.00%) 1204796 ( 0.00%) 1143223 (-5.39%)
randread-128 4135958 ( 0.00%) 4012317 (-3.08%) 4135958 ( 0.00%)
randread-256 3454704 ( 0.00%) 3511189 ( 1.61%) 3511189 ( 1.61%)
randread-512 3437042 ( 0.00%) 3366987 (-2.08%) 3437042 ( 0.00%)
randread-1024 3301774 ( 0.00%) 3401130 ( 2.92%) 3403826 ( 3.00%)
randread-2048 3549844 ( 0.00%) 3391470 (-4.67%) 3477979 (-2.07%)
randread-4096 3214912 ( 0.00%) 3261295 ( 1.42%) 3258820 ( 1.35%)
randread-8192 2818958 ( 0.00%) 2836645 ( 0.62%) 2861450 ( 1.48%)
randread-16384 2571662 ( 0.00%) 2515924 (-2.22%) 2564465 (-0.28%)
randread-32768 2319848 ( 0.00%) 2331892 ( 0.52%) 2333594 ( 0.59%)
randread-65536 2288193 ( 0.00%) 2297103 ( 0.39%) 2301123 ( 0.56%)
randread-131072 2270669 ( 0.00%) 2275707 ( 0.22%) 2279150 ( 0.37%)
randread-262144 2258949 ( 0.00%) 2264700 ( 0.25%) 2259975 ( 0.05%)
randread-524288 2250529 ( 0.00%) 2240365 (-0.45%) 2242837 (-0.34%)
randwrite-64 942521 ( 0.00%) 939223 (-0.35%) 939223 (-0.35%)
randwrite-128 962469 ( 0.00%) 971174 ( 0.90%) 969421 ( 0.72%)
randwrite-256 980760 ( 0.00%) 980760 ( 0.00%) 966633 (-1.46%)
randwrite-512 1190529 ( 0.00%) 1138158 (-4.60%) 1040545 (-14.41%)
randwrite-1024 1361836 ( 0.00%) 1361836 ( 0.00%) 1367037 ( 0.38%)
randwrite-2048 1325646 ( 0.00%) 1364390 ( 2.84%) 1361794 ( 2.65%)
randwrite-4096 1360371 ( 0.00%) 1372653 ( 0.89%) 1363935 ( 0.26%)
randwrite-8192 1291680 ( 0.00%) 1305272 ( 1.04%) 1285206 (-0.50%)
randwrite-16384 1253666 ( 0.00%) 1255865 ( 0.18%) 1231889 (-1.77%)
randwrite-32768 1239139 ( 0.00%) 1250641 ( 0.92%) 1224239 (-1.22%)
randwrite-65536 1223186 ( 0.00%) 1228115 ( 0.40%) 1203094 (-1.67%)
randwrite-131072 1207002 ( 0.00%) 1215733 ( 0.72%) 1198229 (-0.73%)
randwrite-262144 1184954 ( 0.00%) 1201542 ( 1.38%) 1179145 (-0.49%)
randwrite-524288 96156 ( 0.00%) 96502 ( 0.36%) 95942 (-0.22%)
bkwdread-64 2923952 ( 0.00%) 3022727 ( 3.27%) 2772930 (-5.45%)
bkwdread-128 3785961 ( 0.00%) 3657016 (-3.53%) 3657016 (-3.53%)
bkwdread-256 3017775 ( 0.00%) 3052087 ( 1.12%) 3159869 ( 4.50%)
bkwdread-512 2875558 ( 0.00%) 2845081 (-1.07%) 2841317 (-1.21%)
bkwdread-1024 3083680 ( 0.00%) 3181915 ( 3.09%) 3140041 ( 1.79%)
bkwdread-2048 3266376 ( 0.00%) 3282603 ( 0.49%) 3281349 ( 0.46%)
bkwdread-4096 3207709 ( 0.00%) 3287506 ( 2.43%) 3248345 ( 1.25%)
bkwdread-8192 2777710 ( 0.00%) 2792156 ( 0.52%) 2777036 (-0.02%)
bkwdread-16384 2565614 ( 0.00%) 2570412 ( 0.19%) 2541795 (-0.94%)
bkwdread-32768 2472332 ( 0.00%) 2495631 ( 0.93%) 2284260 (-8.23%)
bkwdread-65536 2435202 ( 0.00%) 2391036 (-1.85%) 2361477 (-3.12%)
bkwdread-131072 2417850 ( 0.00%) 2453436 ( 1.45%) 2417903 ( 0.00%)
bkwdread-262144 2468467 ( 0.00%) 2491433 ( 0.92%) 2446649 (-0.89%)
bkwdread-524288 2513411 ( 0.00%) 2534789 ( 0.84%) 2486486 (-1.08%)

FTrace Reclaim Statistics
traceonly-v2r5 stackreduce-v2r5 nodirect-v2r5
Direct reclaims 0 0 0
Direct reclaim pages scanned 0 0 0
Direct reclaim write async I/O 0 0 0
Direct reclaim write sync I/O 0 0 0
Wake kswapd requests 0 0 0
Kswapd wakeups 0 0 0
Kswapd pages scanned 0 0 0
Kswapd reclaim write async I/O 0 0 0
Kswapd reclaim write sync I/O 0 0 0
Time stalled direct reclaim (ms) 0.00 0.00 0.00
Time kswapd awake (ms) 0.00 0.00 0.00

User/Sys Time Running Test (seconds) 14.54 14.43 14.51
Percentage Time Spent Direct Reclaim 0.00% 0.00% 0.00%
Total Elapsed Time (seconds) 106.39 104.49 107.15
Percentage Time kswapd Awake 0.00% 0.00% 0.00%

No big surprises in terms of performance. I know there are gains and losses
but I've always had trouble getting very stable figures out of IOZone so
I find it hard to draw conclusions from them. I should revisit what I'm
doing there to see what's wrong.

In terms of reclaim, nothing interesting happened.

SysBench
========
traceonly-v2r5 stackreduce-v2r5 nodirect-v2r5
1 11025.01 ( 0.00%) 10249.52 (-7.57%) 10430.57 (-5.70%)
2 3844.63 ( 0.00%) 4988.95 (22.94%) 4038.95 ( 4.81%)
3 3210.23 ( 0.00%) 2918.52 (-9.99%) 3113.38 (-3.11%)
4 1958.91 ( 0.00%) 1987.69 ( 1.45%) 1808.37 (-8.32%)
5 2864.92 ( 0.00%) 3126.13 ( 8.36%) 2355.70 (-21.62%)
6 4831.63 ( 0.00%) 3815.67 (-26.63%) 4164.09 (-16.03%)
7 3788.37 ( 0.00%) 3140.39 (-20.63%) 3471.36 (-9.13%)
8 2293.61 ( 0.00%) 1636.87 (-40.12%) 1754.25 (-30.75%)
FTrace Reclaim Statistics
traceonly-v2r5 stackreduce-v2r5 nodirect-v2r5
Direct reclaims 9843 13398 51651
Direct reclaim pages scanned 871367 1008709 3080593
Direct reclaim write async I/O 24883 30699 0
Direct reclaim write sync I/O 0 0 0
Wake kswapd requests 7070819 6961672 11268341
Kswapd wakeups 1578 1500 943
Kswapd pages scanned 22016558 21779455 17393431
Kswapd reclaim write async I/O 1161346 1101641 1717759
Kswapd reclaim write sync I/O 0 0 0
Time stalled direct reclaim (ms) 26.11 45.04 2.97
Time kswapd awake (ms) 5105.06 5135.93 6086.32

User/Sys Time Running Test (seconds) 734.52 712.39 703.9
Percentage Time Spent Direct Reclaim 0.00% 0.00% 0.00%
Total Elapsed Time (seconds) 9710.02 9589.20 9334.45
Percentage Time kswapd Awake 0.06% 0.00% 0.00%

Unlike other sysbench results I post, this is the result of a read/write
test. As the machine is under-provisioned for the type of tests, figures
are very unstable. For example, for each of the thread counts from 1-8, the
test is run a minimum of 5 times. If the estimated mean is not within 1%,
it's run up to a maximum of 10 times trying to get a stable average. None
of these averages are stable with variances up to 15%. Part of the problem
is that larger thread counts push the test into swap as the memory is
insufficient. I could tune for this, but it was reclaim that was important.

To illustrate though, here is a graph of total io in comparison to swap io
as reported by vmstat. The update frequency was 2 seconds so the IO shown in
the graph is about the maximum capacity of the disk for the entire duration
of the test with swap kicking in every so often so this was heavily IO bound.

http://www.csn.ul.ie/~mel/postings/nodirect-20100614/totalio-swapio-comparison.ps

The writing back of dirty pages was a factor. It did happen, but it was
a negligible portion of the overall IO. For example, with just tracing a
total of 97MB or 2% of the pages scanned by direct reclaim was written back
and it was all async IO. The time stalled in direct reclaim was negligible
although as you'd expect, reduced by not writing back at all.

What is interesting is that kswapd wake requests were raised by direc reclaim
not writing back pages. My theory is that it's because the processes are
making forward progress meaning they need more memory faster and are not
calling congestion_wait as they would have before. kswapd is awake longer
as a result of direct reclaim not writing back pages.

Between 5-10% of the pages scanned by kswapd need to be written back based
on these three kernels. As the disk was maxed out all of the time, I'm having
trouble deciding whether this is "too much IO" or not but I'm leaning towards
"no". I'd expect that the flusher was also having time getting IO bandwidth.

Simple Writeback Test
=====================
traceonly-v2r5 stackreduce-v2r5 nodirect-v2r5
Direct reclaims 1923 2394 2683
Direct reclaim pages scanned 246400 252256 282896
Direct reclaim write async I/O 0 0 0
Direct reclaim write sync I/O 0 0 0
Wake kswapd requests 1496401 1648245 1709870
Kswapd wakeups 1140 1118 1113
Kswapd pages scanned 10999585 10982748 10851473
Kswapd reclaim write async I/O 0 0 1398
Kswapd reclaim write sync I/O 0 0 0
Time stalled direct reclaim (ms) 0.01 0.01 0.01
Time kswapd awake (ms) 293.54 293.68 285.51

User/Sys Time Running Test (seconds) 105.17 102.81 104.34
Percentage Time Spent Direct Reclaim 0.00% 0.00% 0.00%
Total Elapsed Time (seconds) 638.75 640.63 639.42
Percentage Time kswapd Awake 0.04% 0.00% 0.00%

This test starting with 4 threads, doubling the number of threads on each
iteration up to 64. Each iteration writes 4*RAM amount of files to disk
using dd to dirty memory and conv=fsync to have some sort of stability in
the results.

Direc reclaim writeback was not a problem for this test even though a number
of pages were scanned so there is no reason not to disable it. kswapd
encountered some pages in the "nodirect" kernel but it's likely a timing
issue.

Stress HighAlloc
================
stress-highalloc stress-highalloc stress-highalloc
traceonly-v2r5 stackreduce-v2r5 nodirect-v2r5
Pass 1 76.00 ( 0.00%) 77.00 ( 1.00%) 70.00 (-6.00%)
Pass 2 78.00 ( 0.00%) 78.00 ( 0.00%) 73.00 (-5.00%)
At Rest 80.00 ( 0.00%) 79.00 (-1.00%) 78.00 (-2.00%)

FTrace Reclaim Statistics
stress-highalloc stress-highalloc stress-highalloc
traceonly-v2r5 stackreduce-v2r5 nodirect-v2r5
Direct reclaims 1245 1247 1369
Direct reclaim pages scanned 180262 177032 164337
Direct reclaim write async I/O 35211 30075 0
Direct reclaim write sync I/O 17994 13127 0
Wake kswapd requests 4211 4868 4386
Kswapd wakeups 842 808 739
Kswapd pages scanned 41757111 33360435 9946823
Kswapd reclaim write async I/O 4872154 3154195 791840
Kswapd reclaim write sync I/O 0 0 0
Time stalled direct reclaim (ms) 9064.33 7249.29 2868.22
Time kswapd awake (ms) 6250.77 4612.99 1937.64

User/Sys Time Running Test (seconds) 2822.01 2812.45 2629.6
Percentage Time Spent Direct Reclaim 0.10% 0.00% 0.00%
Total Elapsed Time (seconds) 11365.05 9682.00 5210.19
Percentage Time kswapd Awake 0.02% 0.00% 0.00%

This test builds a large number of kernels simultaneously so that the total
workload is 1.5 times the size of RAM. It then attempts to allocate all of
RAM as huge pages. The metric is the percentage of memory allocated using
load (Pass 1), a second attempt under load (Pass 2) and when the kernel
compiles are finishes and the system is quiet (At Rest).

The success figures were comparaible with or without direct reclaim. I know
PPC64's success rates are hit a lot worse than this but I think it could be
improved in other means than what we have today so I'm less worried about it.

Unlike the other tests, synchronous direct writeback is a factor in this
test because of lumpy reclaim. This increases the stall time of a lumpy
reclaimer by quite a margin. Compare the "Time stalled direct reclaim"
between the vanilla and nodirect kernel - the nodirect kernel is stalled
less than a third of the time. Interestingly, the time kswapd is stalled
is significantly reduced as well and overall, the test completes a lot faster.

Whether this is because of improved IO patterns or just because lumpy
reclaim stalling on synchronous IO is a bad idea, I'm not sure but either
way, the patch looks like a good idea from the perspective of this test.

Based on this series of tests at least, there appears to be good reasons
for preventing direct reclaim writing back pages and it does not appear
we are currently spending a lot of our time in writeback. It remains to
be seen if it's still true with dirty ratio is higher (e.g. 40) or the
amount of available memory differs (e.g. 256MB) but the trace points and
post-processing script can be used to help figure it out.

Comments?

KOSAKI Motohiro (2):
vmscan: kill prev_priority completely
vmscan: simplify shrink_inactive_list()

Mel Gorman (11):
tracing, vmscan: Add trace events for kswapd wakeup, sleeping and
direct reclaim
tracing, vmscan: Add trace events for LRU page isolation
tracing, vmscan: Add trace event when a page is written
tracing, vmscan: Add a postprocessing script for reclaim-related
ftrace events
vmscan: Remove unnecessary temporary vars in do_try_to_free_pages
vmscan: Setup pagevec as late as possible in shrink_inactive_list()
vmscan: Setup pagevec as late as possible in shrink_page_list()
vmscan: Update isolated page counters outside of main path in
shrink_inactive_list()
vmscan: Write out dirty pages in batch
vmscan: Do not writeback pages in direct reclaim
fix: script formatting

.../trace/postprocess/trace-vmscan-postprocess.pl | 623 ++++++++++++++++++++
include/linux/mmzone.h | 15 -
include/trace/events/gfpflags.h | 37 ++
include/trace/events/kmem.h | 38 +--
include/trace/events/vmscan.h | 184 ++++++
mm/page_alloc.c | 2 -
mm/vmscan.c | 622 ++++++++++++--------
mm/vmstat.c | 2 -
8 files changed, 1223 insertions(+), 300 deletions(-)
create mode 100644 Documentation/trace/postprocess/trace-vmscan-postprocess.pl
create mode 100644 include/trace/events/gfpflags.h
create mode 100644 include/trace/events/vmscan.h

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/