[RFC PATCH 0/5] Reduce filesystem writeback from page reclaim (again)

From: Mel Gorman
Date: Wed Jul 13 2011 - 10:31:58 EST


(Revisting this from a year ago and following on from the thread
"Re: [PATCH 03/27] xfs: use write_cache_pages for writeback
clustering". Posting an prototype to see if anything obvious is
being missed)

Testing from the XFS folk revealed that there is still too much
I/O from the end of the LRU in kswapd. Previously it was considered
acceptable by VM people for a small number of pages to be written
back from reclaim with testing generally showing about 0.3% of pages
reclaimed were written back (higher if memory was really low). That
writing back a small number of pages is ok has been heavily disputed
for quite some time and Dave Chinner explained it well;

It doesn't have to be a very high number to be a problem. IO
is orders of magnitude slower than the CPU time it takes to
flush a page, so the cost of making a bad flush decision is
very high. And single page writeback from the LRU is almost
always a bad flush decision.

To complicate matters, filesystems respond very differently to requests
from reclaim according to Christoph Hellwig

xfs tries to write it back if the requester is kswapd
ext4 ignores the request if it's a delayed allocation
btrfs ignores the request entirely

I think ext3 just writes back the page but I didn't double check.
Either way, each filesystem will have different performance
characteristics when under memory pressure and there are a lot of
dirty pages.

The objective of this series to for memory reclaim to play nicely
with writeback that is already in progress and throttle reclaimers
appropriately when dirty pages are encountered. The assumption is that
the flushers will always write pages faster than if reclaim issues
the IO. The problem is that reclaim has very little control over how
long before a page in a particular zone or container is cleaned.
This is a serious problem but as the behaviour of ->writepage is
filesystem-dependant, we are already faced with a situation where
reclaim has poor control over page cleaning.

A secondary goal is to avoid the problem whereby direct reclaim
splices two potentially deep call stacks together.

Patch 1 disables writeback of filesystem pages from direct reclaim
entirely. Anonymous pages are still written

Patch 2 disables writeback of filesystem pages from kswapd unless
the priority is raised to the point where kswapd is considered
to be in trouble.

Patch 3 throttles reclaimers if too many dirty pages are being
encountered and the zones or backing devices are congested.

Patch 4 invalidates dirty pages found at the end of the LRU so they
are reclaimed quickly after being written back rather than
waiting for a reclaimer to find them

Patch 5 tries to prioritise inodes backing dirty pages found at the end
of the LRU.

This is a prototype only and it's probable that I forgot or omitted
some issue brought up over the last year and a bit. I have not thought
about how this affects memcg and I have some concerns about patches
4 and 5. Patch 4 may reclaim too many pages as a reclaimer will skip
the dirty page, reclaim a clean page and later the dirty page gets
reclaimed anyway when writeback completes. I don't think it matters
but it's worth thinking about. Patch 5 is potentially a problem
because move_expired_inodes() is now walking the full delayed_queue
list. Is that a problem? I also have no double checked it's safe
to add I_DIRTY_RECLAIM or that the locking is correct. Basically,
patch 5 is a quick hack to see if it's worthwhile and may be rendered
unnecessary by Wu Fengguang or Jan Kara.

I consider this series to be orthogonal to the writeback work going
on at the moment so shout if that assumption is in error.

I tested this on ext3, ext4, btrfs and xfs using fs_mark and a micro
benchmark that does a streaming write to a large mapping (exercises
use-once LRU logic). The command line for fs_mark looked something like

./fs_mark -d /tmp/fsmark-2676 -D 100 -N 150 -n 150 -L 25 -t 1 -S0 -s 10485760

The machine was booted with "nr_cpus=1 mem=512M" as according to Dave
this triggers the worst behaviour.

6 kernels are tested.

vanilla 3.0-rc6
nodirectwb-v1r3 patch 1
lesskswapdwb-v1r3p patches 1-2
throttle-v1r10 patches 1-3
immediate-v1r10 patches 1-4
prioinode-v1r10 patches 1-5

During testing, a number of monitors were running to gather information
from ftrace in particular. This disrupts the results of course because
recording the information generates IO in itself but I'm ignoring
that for the moment so the effect of the patches can be seen.

I've posted the raw reports for each filesystem at

http://www.csn.ul.ie/~mel/postings/reclaim-20110713/writeback-ext3/sandy/comparison.html
http://www.csn.ul.ie/~mel/postings/reclaim-20110713/writeback-ext4/sandy/comparison.html
http://www.csn.ul.ie/~mel/postings/reclaim-20110713/writeback-btrfs/sandy/comparison.html
http://www.csn.ul.ie/~mel/postings/reclaim-20110713/writeback-xfs/sandy/comparison.html

As it was Dave and Christoph that brought this back up, here is the
XFS report in a bit more detail;

FS-Mark
fsmark-3.0.0 3.0.0-rc6 3.0.0-rc6 3.0.0-rc6 3.0.0-rc6 3.0.0-rc6
rc6-vanilla nodirectwb-v1r3 lesskswapdwb-v1r3 throttle-v1r10 immediate-v1r10 prioinode-v1r10
Files/s min 5.30 ( 0.00%) 5.10 (-3.92%) 5.40 ( 1.85%) 5.70 ( 7.02%) 5.80 ( 8.62%) 5.70 ( 7.02%)
Files/s mean 6.93 ( 0.00%) 6.96 ( 0.40%) 7.11 ( 2.53%) 7.52 ( 7.82%) 7.44 ( 6.83%) 7.48 ( 7.38%)
Files/s stddev 0.89 ( 0.00%) 0.99 (10.62%) 0.85 (-4.18%) 1.02 (13.23%) 1.08 (18.06%) 1.00 (10.72%)
Files/s max 8.10 ( 0.00%) 8.60 ( 5.81%) 8.20 ( 1.22%) 9.50 (14.74%) 9.00 (10.00%) 9.10 (10.99%)
Overhead min 6623.00 ( 0.00%) 6417.00 ( 3.21%) 6035.00 ( 9.74%) 6354.00 ( 4.23%) 6213.00 ( 6.60%) 6491.00 ( 2.03%)
Overhead mean 29678.24 ( 0.00%) 40053.96 (-25.90%) 18278.56 (62.37%) 16365.20 (81.35%) 11987.40 (147.58%) 15606.36 (90.17%)
Overhead stddev 68727.49 ( 0.00%) 116258.18 (-40.88%) 34121.42 (101.42%) 28963.27 (137.29%) 17221.33 (299.08%) 26231.50 (162.00%)
Overhead max 339993.00 ( 0.00%) 588147.00 (-42.19%) 148281.00 (129.29%) 140568.00 (141.87%) 77836.00 (336.81%) 124728.00 (172.59%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 34.97 35.31 31.16 30.47 29.85 29.66
Total Elapsed Time (seconds) 567.08 566.84 551.75 525.81 534.91 526.32

Average files per second is increased by a nice percentage albeit
just within the standard deviation. Consider the type of test this is,
variability was inevitable but will double check without monitoring.

The overhead (time spent in non-filesystem-related activities) is
reduced a *lot* and is a lot less variable. Time to completion is
improved across the board which is always good because it implies
that IO was consistently higher which is sortof visible 4 minutes into the test at

http://www.csn.ul.ie/~mel/postings/reclaim-20110713/writeback-xfs/sandy/blockio-comparison-sandy.png
http://www.csn.ul.ie/~mel/postings/reclaim-20110713/writeback-xfs/sandy/blockio-comparison-smooth-sandy.png

kswapd CPU usage is also interesting

http://www.csn.ul.ie/~mel/postings/reclaim-20110713/writeback-xfs/sandy/kswapdcpu-comparison-smooth-sandy.png

Note how preventing kswapd reclaiming dirty pages pushes up its CPU
usage as it scans more pages but the throttle brings it back down
and reduced further by patches 4 and 5.

MMTests Statistics: vmstat
Page Ins 189840 196608 189864 128120 126148 151888
Page Outs 38439897 38420872 38422937 38395008 38367766 38396612
Swap Ins 19468 20555 20024 4933 3799 4588
Swap Outs 10019 10388 10353 4737 3617 4084
Direct pages scanned 4865170 4903030 1359813 408460 101716 199483
Kswapd pages scanned 8202014 8146467 16980235 19428420 14269907 14103872
Kswapd pages reclaimed 4700400 4665093 8205753 9143997 9449722 9358347
Direct pages reclaimed 4864514 4901411 1359368 407711 100520 198323
Kswapd efficiency 57% 57% 48% 47% 66% 66%
Kswapd velocity 14463.592 14371.722 30775.233 36949.506 26677.211 26797.142
Direct efficiency 99% 99% 99% 99% 98% 99%
Direct velocity 8579.336 8649.760 2464.546 776.821 190.155 379.015
Percentage direct scans 37% 37% 7% 2% 0% 1%
Page writes by reclaim 14511 14721 10387 4819 3617 4084
Page writes skipped 0 30 2300502 2774735 0 0
Page reclaim invalidate 0 0 0 0 5155 3509
Page reclaim throttled 0 0 0 65112 190 190
Slabs scanned 16512 17920 18048 17536 16640 17408
Direct inode steals 0 0 0 0 0 0
Kswapd inode steals 5180 5318 5177 5178 5179 5193
Kswapd skipped wait 131 0 4 44 0 0
Compaction stalls 2 2 0 0 5 1
Compaction success 2 2 0 0 2 1
Compaction failures 0 0 0 0 3 0
Compaction pages moved 0 0 0 0 1049 0
Compaction move failure 0 0 0 0 96 0

These stats are based on information from /proc/vmstat

"Kswapd efficiency" is the percentage of pages reclaimed to pages
scanned. The higher the percentage is the better because a low
percentage implies that kswapd is scanning uselessly. As the workload
dirties memory heavily and is a small machine, the efficiency starts
low at 57% but increases to 66% with all the patches applied.

"Kswapd velocity" is the average number of pages scanned per
second. The patches increase this as it's no longer getting blocked
on page writes so it's expected.

Direct reclaim work is significantly reduced going from 37% of all
pages scanned to 1% with all patches applied. This implies that
processes are getting stalled less.

Page writes by reclaim is what is motivating this series. It goes
from 14511 pages to 4084 which is a big improvement. We'll see later
if these were anonymous or file-backed pages.

"Page writes skipped" are dirty pages encountered at the end of the
LRU and only exists for patches 2, 3 and 4. It shows that kswapd is
encountering very large numbers of dirty pages (debugging showed they
weren't under writeback). The number of pages that get invalidated and
freed later is a more reasonable number and "page reclaim throttled"
shows that throttling is not a major problem.

FTrace Reclaim Statistics: vmscan
fsmark-3.0.0 3.0.0-rc6 3.0.0-rc6 3.0.0-rc6 3.0.0-rc6 3.0.0-rc6
rc6-vanilla nodirectwb-v1r3 lesskswapdwb-v1r3 throttle-v1r10 immediate-v1r10 prioinode-v1r10
Direct reclaims 89145 89785 24921 7546 1954 3747
Direct reclaim pages scanned 4865170 4903030 1359813 408460 101716 199483
Direct reclaim pages reclaimed 4864514 4901411 1359368 407711 100520 198323
Direct reclaim write file async I/O 0 0 0 0 0 0
Direct reclaim write anon async I/O 0 0 0 3 1 0
Direct reclaim write file sync I/O 0 0 0 0 0 0
Direct reclaim write anon sync I/O 0 0 0 0 0 0
Wake kswapd requests 11152 11021 21223 24029 26797 26672
Kswapd wakeups 421 397 761 778 776 742
Kswapd pages scanned 8202014 8146467 16980235 19428420 14269907 14103872
Kswapd pages reclaimed 4700400 4665093 8205753 9143997 9449722 9358347
Kswapd reclaim write file async I/O 4483 4286 0 1 0 0
Kswapd reclaim write anon async I/O 10027 10435 10387 4815 3616 4084
Kswapd reclaim write file sync I/O 0 0 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0 0 0
Time stalled direct reclaim (seconds) 0.26 0.25 0.08 0.05 0.04 0.08
Time kswapd awake (seconds) 493.26 494.05 430.09 420.52 428.55 428.81

Total pages scanned 13067184 13049497 18340048 19836880 14371623 14303355
Total pages reclaimed 9564914 9566504 9565121 9551708 9550242 9556670
%age total pages scanned/reclaimed 73.20% 73.31% 52.15% 48.15% 66.45% 66.81%
%age total pages scanned/written 0.11% 0.11% 0.06% 0.02% 0.03% 0.03%
%age file pages scanned/written 0.03% 0.03% 0.00% 0.00% 0.00% 0.00%
Percentage Time Spent Direct Reclaim 0.74% 0.70% 0.26% 0.16% 0.13% 0.27%
Percentage Time kswapd Awake 86.98% 87.16% 77.95% 79.98% 80.12% 81.47%

This is based on information from the vmscan tracepoints introduced
the last time this issue came up.

Direct reclaim writes were never a problem according to this.

kswapd writes of file-backed pages on the other hand went from 4483 to
0 which is nice and part of the objective after all. The page writes of
4084 recorded from /proc/vmstat with all patches applied iwas clearly
due to writing anonymous pages as there is a direct correlation there.

Time spent in direct reclaim is reduced quite a bit as well as the
time kswapd spent awake.

FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 0 0 0 0 0 0
Direct time congest waited 0ms 0ms 0ms 0ms 0ms 0ms
Direct full congest waited 0 0 0 0 0 0
Direct number conditional waited 0 1 0 56 8 0
Direct time conditional waited 0ms 0ms 0ms 0ms 0ms 0ms
Direct full conditional waited 0 0 0 0 0 0
KSwapd number congest waited 4 0 1 0 6 0
KSwapd time congest waited 400ms 0ms 100ms 0ms 501ms 0ms
KSwapd full congest waited 4 0 1 0 5 0
KSwapd number conditional waited 0 0 0 65056 189 190
KSwapd time conditional waited 0ms 0ms 0ms 1ms 0ms 0ms
KSwapd full conditional waited 0 0 0 0 0 0

This is based on some of the writeback tracepoints. It's interesting
to note that while kswapd got throttled 190 times with all patches
applied, it spent negligible time asleep so probably just called
cond_resched(). This implies that neither the zone or the backing
device was congested. As there is only once source of IO, this is
expected. With multiple processes, this picture might change.


MICRO
micro-3.0.0 3.0.0-rc6 3.0.0-rc6 3.0.0-rc6 3.0.0-rc6 3.0.0-rc6
rc6-vanilla nodirectwb-v1r3 lesskswapdwb-v1r3 throttle-v1r10 immediate-v1r10 prioinode-v1r10
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 6.95 7.2 6.84 6.33 5.97 6.13
Total Elapsed Time (seconds) 56.34 65.04 66.53 63.24 52.48 63.00

This is a test that just writes a mapping. Unfortunately, the time to
completion is increased by the series. Again I'll have to run without
any monitoring to confirm it's a problem.

MMTests Statistics: vmstat
Page Ins 46928 50660 48504 42888 42648 43036
Page Outs 4990816 4994987 4987572 4999242 4981324 4990627
Swap Ins 2573 3234 2470 1396 1352 1297
Swap Outs 2316 2578 2360 937 912 873
Direct pages scanned 1834430 2016994 1623675 1843754 1922668 1941916
Kswapd pages scanned 1399007 1272637 1842874 1810867 1425366 1426536
Kswapd pages reclaimed 637708 657418 860512 884531 906608 927206
Direct pages reclaimed 536567 517876 314115 289472 272265 252361
Kswapd efficiency 45% 51% 46% 48% 63% 64%
Kswapd velocity 24831.505 19566.990 27699.895 28634.836 27160.175 22643.429
Direct efficiency 29% 25% 19% 15% 14% 12%
Direct velocity 32559.993 31011.593 24405.156 29154.870 36636.204 30824.063
Percentage direct scans 56% 61% 46% 50% 57% 57%
Page writes by reclaim 2706 2910 2416 969 912 873
Page writes skipped 0 12640 148339 166844 0 0
Page reclaim invalidate 0 0 0 0 12 58
Page reclaim throttled 0 0 0 4788 7 9
Slabs scanned 4096 5248 5120 6656 4480 16768
Direct inode steals 531 1189 348 1166 700 3783
Kswapd inode steals 164 0 349 0 0 9
Kswapd skipped wait 78 35 74 51 14 10
Compaction stalls 0 0 1 0 0 0
Compaction success 0 0 1 0 0 0
Compaction failures 0 0 0 0 0 0
Compaction pages moved 0 0 0 0 0 0
Compaction move failure 0 0 0 0 0 0

Kswapd efficiency up but kswapd was doing less work according to kswapd velocity.

Direct reclaim efficiency is worse as well.

It's writing fewer pages at least.

FTrace Reclaim Statistics: vmscan
micro-3.0.0 3.0.0-rc6 3.0.0-rc6 3.0.0-rc6 3.0.0-rc6 3.0.0-rc6
rc6-vanilla nodirectwb-v1r3 lesskswapdwb-v1r3 throttle-v1r10 immediate-v1r10 prioinode-v1r10
Direct reclaims 9823 9477 5737 5347 5078 4720
Direct reclaim pages scanned 1834430 2016994 1623675 1843754 1922668 1941916
Direct reclaim pages reclaimed 536567 517876 314115 289472 272265 252361
Direct reclaim write file async I/O 0 0 0 0 0 0
Direct reclaim write anon async I/O 0 0 0 0 16 0
Direct reclaim write file sync I/O 0 0 0 0 0 0
Direct reclaim write anon sync I/O 0 0 0 0 0 0
Wake kswapd requests 1636 1692 2177 2403 2707 2757
Kswapd wakeups 28 29 30 34 15 23
Kswapd pages scanned 1399007 1272637 1842874 1810867 1425366 1426536
Kswapd pages reclaimed 637708 657418 860512 884531 906608 927206
Kswapd reclaim write file async I/O 380 332 56 32 0 0
Kswapd reclaim write anon async I/O 2326 2578 2360 937 896 873
Kswapd reclaim write file sync I/O 0 0 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0 0 0
Time stalled direct reclaim (seconds) 2.06 2.10 1.62 2.65 2.25 1.86
Time kswapd awake (seconds) 49.44 56.39 54.31 55.45 47.00 56.74

Total pages scanned 3233437 3289631 3466549 3654621 3348034 3368452
Total pages reclaimed 1174275 1175294 1174627 1174003 1178873 1179567
%age total pages scanned/reclaimed 36.32% 35.73% 33.88% 32.12% 35.21% 35.02%
%age total pages scanned/written 0.08% 0.09% 0.07% 0.03% 0.03% 0.03%
%age file pages scanned/written 0.01% 0.01% 0.00% 0.00% 0.00% 0.00%
Percentage Time Spent Direct Reclaim 22.86% 22.58% 19.15% 29.51% 27.37% 23.28%
Percentage Time kswapd Awake 87.75% 86.70% 81.63% 87.68% 89.56% 90.06%

Again, writes of file pages are reduced but kswapd is clearly awake
for longer.

What is interesting is that the number of pages written without the
patches was already quite low. This means there is relatively little room
for improvement in this benchmark.

FTrace Reclaim Statistics: congestion_wait
Generating ftrace report ftrace-3.0.0-rc6-prioinode-v1r10-micro-congestion.report
Direct number congest waited 0 0 0 0 0 0
Direct time congest waited 0ms 0ms 0ms 0ms 0ms 0ms
Direct full congest waited 0 0 0 0 0 0
Direct number conditional waited 768 793 704 1359 608 674
Direct time conditional waited 0ms 0ms 0ms 0ms 0ms 0ms
Direct full conditional waited 0 0 0 0 0 0
KSwapd number congest waited 41 22 58 43 78 92
KSwapd time congest waited 2937ms 2200ms 4543ms 4300ms 7800ms 9200ms
KSwapd full congest waited 29 22 45 43 78 92
KSwapd number conditional waited 0 0 0 4284 4 9
KSwapd time conditional waited 0ms 0ms 0ms 0ms 0ms 0ms
KSwapd full conditional waited 0 0 0 0 0 0

Some throttling but little time sleep.

The objective of the series - reducing writes from reclaim - is
met with filesystem writes from reclaim reduced to 0 with reclaim
in general doing less work. ext3, ext4 and xfs all showed marked
improvements for fs_mark in this configuration. btrfs looked worse
but it's within the noise and I'd expect the patches to have little
or no impact there due it ignoring ->writepage from reclaim.

I'm rerunning the tests without monitors at the moment to verify the
performance improvements which will take about 6 hours to complete
but so far it looks promising.

Comments?

fs/fs-writeback.c | 56 ++++++++++++++++++++++++++++++++++++++++++++-
include/linux/fs.h | 5 ++-
include/linux/mmzone.h | 2 +
include/linux/writeback.h | 1 +
mm/vmscan.c | 55 +++++++++++++++++++++++++++++++++++++++++--
mm/vmstat.c | 2 +
6 files changed, 115 insertions(+), 6 deletions(-)

--
1.7.3.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/