[patch 00/10] mm: thrash detection-based file cache sizing

From: Johannes Weiner
Date: Thu May 30 2013 - 14:05:23 EST

Next message: Rafael Aquini: "[PATCH] swap: avoid read_swap_cache_async() race to deadlock while waiting on discard I/O compeletion"
Previous message: Johannes Weiner: "[patch 01/10] mm: page_alloc: zone round-robin allocator"
Next in thread: Johannes Weiner: "[patch 01/10] mm: page_alloc: zone round-robin allocator"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

The VM maintains cached filesystem pages on two types of lists. One
list holds the pages recently faulted into the cache, the other list
holds pages that have been referenced repeatedly on that first list.
The idea is to prefer reclaiming young pages over those that have
shown to benefit from caching in the past. We call the recently used
list "inactive list" and the frequently used list "active list".

The tricky part of this model is finding the right balance between
them. A big inactive list may not leave enough room for the active
list to protect all the frequently used pages. A big active list may
not leave enough room for the inactive list for a new set of
frequently used pages, "working set", to establish itself because the
young pages get pushed out of memory before having a chance to get
promoted.

Historically, every reclaim scan of the inactive list also took a
smaller number of pages from the tail of the active list and moved
them to the head of the inactive list. This model gave established
working sets more gracetime in the face of temporary use once streams,
but was not satisfactory when use once streaming persisted over longer
periods of time and the established working set was temporarily
suspended, like a nightly backup evicting all the interactive user
program data.

Subsequently, the rules were changed to only age active pages when
they exceeded the amount of inactive pages, i.e. leave the working set
alone as long as the other half of memory is easy to reclaim use once
pages. This works well until working set transitions exceed the size
of half of memory and the average access distance between the pages of
the new working set is bigger than the inactive list. The VM will
mistake the thrashing new working set for use once streaming, while
the unused old working set pages are stuck on the active list.

This happens on file servers and media streaming servers, where the
popular set of files changes over time. Even though the individual
files might be smaller than half of memory, concurrent access to many
of them may still result in their inter-reference distance being
greater than half of memory. It's also been reported on database
workloads that switch back and forth between tables that are bigger
than half of memory.

This series solves the problem by maintaining a history of pages
evicted from the inactive list, enabling the VM to tell actual use
once streaming from inactive list thrashing and subsequently adjust
the balance between the lists.

Version 2 of this series includes many updates to the comments,
documentation, code structure, and eviction history tracking in
response to Peter Zijlstra, Rik van Riel, Minchan Kim, Andrea
Arcangeli, Andrew Morton, and Mel Gorman. Thanks a lot!!

wschange - test adaptiveness to new workingsets
-----------------------------------------------

On a 16G machine, a sequence of 12G files is read into the cache.
Every file is read repeatedly until fully cached in memory, then the
test moves on to the next file to show how quickly the VM adapts to a
new workingset.

--- vanilla:
Dropping caches...
Reading files until fully cached (+2 reads for activation):
data-1 (1): 9.32 4.48 4.14 4.10
data-2 (1): 9.73 9.95 10.00 10.00 9.99 9.75 9.99 9.56 10.04 9.56 10.02
9.62 10.02 9.56 10.06 9.57 10.04 9.58 9.74 10.06 10.04 10.04 10.06
9.60 10.07 10.07 9.70 10.03 10.07 9.65
ERROR: data-2 not fully cached after reading it 30x

The vanilla kernel never adapts to new workingsets with
inter-reference distances bigger than half of memory. The active list
is simply not challenged as long as it is bigger than the inactive
list, i.e. effectively half of memory in size, which does not give the
new pages enough time for activation. As a result, they are thrashing
on the inactive list, which the VM mistakes for "plenty of used-once
cache" and protects the stale cache indefinitely.

--- patched:
Dropping caches...
Reading files until fully cached (+2 reads for activation):
data-1 (1): 9.41 4.58 4.21 4.16
data-2 (1): 9.58 10.00 9.72 10.22 5.77 4.29 4.22 4.20
data-3 (1): 9.71 9.71 10.13 10.25 6.08 4.42 4.19 4.17
data-1 (2): 10.00 9.79 10.32 7.53 4.49 4.21 4.18
data-2 (2): 10.02 10.27 9.10 4.64 4.25 4.19
data-3 (2): 10.02 10.33 9.14 4.66 4.25 4.21
data-1 (3): 10.04 10.35 9.18 4.67 4.27 4.22
data-2 (3): 10.08 10.36 9.33 4.72 4.26 4.23
data-3 (3): 10.09 10.41 9.31 4.72 4.29 4.24
...

The patched kernel detects the thrashing on the inactive list and
challenges the stale cache on the active list, which is eventually
evicted to make room for the new workingset.

wsprotect - test protection of workingset in presence of streaming
------------------------------------------------------------------

Streaming data does not benefit from caching, and repeatedly access
data that is bigger than memory can not be reasonably cached at this
point. That's why the VM needs to protect an existing working set in
the presence of such streaming / uncachable competetitor sets.

On a 16G machine, a 4G file is read into cache. When a 17G file is
read repeatedly, the 4G workingset data should remain cached as much
as possible.

--- vanilla:
Dropping caches...
Caching workingset file 'ws':
3.13
1.49
1.37
1.36
1.37
mincore: ws: 1048576/1048576 (100%)
Repeatedly streaming bigger than memory file 'stream':
13.46
14.09
14.09
14.14
14.09
14.00
13.45
13.43
13.47
14.04
mincore: ws: 1048576/1048576 (100%)

--- patched:
Dropping caches...
Caching workingset file 'ws':
3.18
1.56
1.43
1.41
1.41
mincore: ws: 1048576/1048576 (100%)
Repeatedly streaming bigger than memory file 'stream':
13.45
13.66
13.69
13.75
13.85
13.83
13.95
14.36
14.38
14.40
mincore: ws: 1048576/1048576 (100%)

The patched kernel observes refaulting streaming pages, but recognizes
that the set is bigger than memory and could never be fully cached.
As a result, it continues to protect the existing cache.

pft - page fault overhead
-------------------------

The zone round robin allocator (RRALLOC) adds some overhead that shows
in this microbenchmark which serves tmpfs faults purely out of memory.
There is no significant impact from the remaining workingset patches:

pft
BASE RRALLOC WORKINGSET
User 1 0.0235 ( 0.00%) 0.0275 (-17.02%) 0.0270 (-14.89%)
User 2 0.0275 ( 0.00%) 0.0275 ( -0.00%) 0.0285 ( -3.64%)
User 3 0.0330 ( 0.00%) 0.0365 (-10.61%) 0.0335 ( -1.52%)
User 4 0.0390 ( 0.00%) 0.0390 ( 0.00%) 0.0380 ( 2.56%)
System 1 0.2645 ( 0.00%) 0.2620 ( 0.95%) 0.2625 ( 0.76%)
System 2 0.3215 ( 0.00%) 0.3310 ( -2.95%) 0.3285 ( -2.18%)
System 3 0.3935 ( 0.00%) 0.4080 ( -3.68%) 0.4130 ( -4.96%)
System 4 0.4920 ( 0.00%) 0.5030 ( -2.24%) 0.5045 ( -2.54%)
Elapsed 1 0.2905 ( 0.00%) 0.2905 ( 0.00%) 0.2905 ( 0.00%)
Elapsed 2 0.1800 ( 0.00%) 0.1800 ( 0.00%) 0.1800 ( 0.00%)
Elapsed 3 0.1500 ( 0.00%) 0.1600 ( -6.67%) 0.1600 ( -6.67%)
Elapsed 4 0.1305 ( 0.00%) 0.1420 ( -8.81%) 0.1415 ( -8.43%)
Faults/cpu 1 667251.7997 ( 0.00%) 666296.4749 ( -0.14%) 667880.8099 ( 0.09%)
Faults/cpu 2 551464.0345 ( 0.00%) 536113.4630 ( -2.78%) 538286.2087 ( -2.39%)
Faults/cpu 3 452403.4425 ( 0.00%) 433856.5320 ( -4.10%) 432193.9888 ( -4.47%)
Faults/cpu 4 362691.4491 ( 0.00%) 356514.8821 ( -1.70%) 356436.5711 ( -1.72%)
Faults/sec 1 663612.5980 ( 0.00%) 662501.4959 ( -0.17%) 664037.3123 ( 0.06%)
Faults/sec 2 1096166.5317 ( 0.00%) 1064679.7154 ( -2.87%) 1068906.1040 ( -2.49%)
Faults/sec 3 1272925.4995 ( 0.00%) 1209241.9167 ( -5.00%) 1202868.9190 ( -5.50%)
Faults/sec 4 1437691.1054 ( 0.00%) 1362549.9877 ( -5.23%) 1381633.9889 ( -3.90%)

BASE RRALLOC WORKINGSET
User 2.53 2.63 2.59
System 34.01 34.94 35.08
Elapsed 18.93 19.49 19.52

kernbench - impact on kernel hacker workloads
---------------------------------------------

In a workload that is not purely allocator bound and also does some
computation and IO, the added allocator overhead is in the noise:

BASE RRALLOC WORKINGSET
User min 1163.95 ( 0.00%) 1131.79 ( 2.76%) 1123.41 ( 3.48%)
User mean 1170.76 ( 0.00%) 1139.68 ( 2.65%) 1125.63 ( 3.85%)
User stddev 6.38 ( 0.00%) 7.91 (-24.00%) 1.37 ( 78.60%)
User max 1182.17 ( 0.00%) 1149.63 ( 2.75%) 1127.55 ( 4.62%)
User range 18.22 ( 0.00%) 17.84 ( 2.09%) 4.14 ( 77.28%)
System min 79.97 ( 0.00%) 80.13 ( -0.20%) 78.21 ( 2.20%)
System mean 80.55 ( 0.00%) 80.68 ( -0.16%) 78.93 ( 2.01%)
System stddev 0.80 ( 0.00%) 0.55 ( 31.73%) 0.44 ( 44.91%)
System max 82.11 ( 0.00%) 81.38 ( 0.89%) 79.33 ( 3.39%)
System range 2.14 ( 0.00%) 1.25 ( 41.59%) 1.12 ( 47.66%)
Elapsed min 319.04 ( 0.00%) 310.75 ( 2.60%) 307.69 ( 3.56%)
Elapsed mean 320.98 ( 0.00%) 313.65 ( 2.28%) 309.33 ( 3.63%)
Elapsed stddev 2.37 ( 0.00%) 2.27 ( 4.37%) 1.40 ( 40.92%)
Elapsed max 325.52 ( 0.00%) 316.83 ( 2.67%) 311.69 ( 4.25%)
Elapsed range 6.48 ( 0.00%) 6.08 ( 6.17%) 4.00 ( 38.27%)
CPU min 388.00 ( 0.00%) 386.00 ( 0.52%) 386.00 ( 0.52%)
CPU mean 389.40 ( 0.00%) 388.60 ( 0.21%) 389.00 ( 0.10%)
CPU stddev 0.80 ( 0.00%) 1.50 (-87.08%) 1.55 (-93.65%)
CPU max 390.00 ( 0.00%) 390.00 ( 0.00%) 390.00 ( 0.00%)
CPU range 2.00 ( 0.00%) 4.00 (-100.00%) 4.00 (-100.00%)

BASE RRALLOC WORKINGSET
User 7009.94 6821.10 6755.85
System 489.88 490.82 481.82
Elapsed 1974.68 1930.58 1909.76

micro - reclaim micro benchmark
-------------------------------

This multi-threaded micro benchmark creates memory pressure with a mix
of anonymous and mapped file memory. By spreading memory among the
available nodes more evently, reclaim behavior is greatly improved by
the round-robin allocator in terms of overall IO, swapping,
efficiency, direct reclaim invocations, reclaim writeback:

BASE RRALLOC WORKINGSET
User 558.11 566.39 564.37
System 28.36 25.60 24.29
Elapsed 394.70 387.38 386.07

BASE RRALLOC WORKINGSET
Page Ins 6853744 5764336 5672052
Page Outs 12136640 10673568 10617640
Swap Ins 0 0 0
Swap Outs 6702 0 0
Direct pages scanned 1751264 176965 238264
Kswapd pages scanned 4832689 3751475 3595031
Kswapd pages reclaimed 2347185 2325232 2239671
Direct pages reclaimed 419104 176226 236990
Kswapd efficiency 48% 61% 62%
Kswapd velocity 12243.955 9684.225 9311.863
Direct efficiency 23% 99% 99%
Direct velocity 4436.950 456.825 617.152
Percentage direct scans 26% 4% 6%
Page writes by reclaim 661863 10182 11310
Page writes file 655161 10182 11310
Page writes anon 6702 0 0
Page reclaim immediate 1083840 15373 24797
Page rescued immediate 0 0 0
Slabs scanned 10240 13312 11776
Direct inode steals 0 0 0
Kswapd inode steals 0 0 0
Kswapd skipped wait 0 0 0
THP fault alloc 2057 2249 3163
THP collapse alloc 0 0 0
THP splits 6 0 0
THP fault fallback 5824 5630 4719
THP collapse fail 0 0 0
Compaction stalls 551 484 610
Compaction success 105 47 91
Compaction failures 446 437 484
Page migrate success 176065 103764 135377
Page migrate failure 0 0 0
Compaction pages isolated 443314 263699 346198
Compaction migrate scanned 687684 598263 640277
Compaction free scanned 14437356 5061851 4744974
Compaction cost 195 116 151
NUMA PTE updates 0 0 0
NUMA hint faults 0 0 0
NUMA hint local faults 0 0 0
NUMA pages migrated 0 0 0
AutoNUMA cost 0 0 0

memcachetest - streaming IO impact on anonyomus workingset
----------------------------------------------------------

This test runs a latency-sensitive in-core workload that is
accompanied by use once page cache streams of increasing size in the
background.

It too shows great improvements in allocation/reclaim behavior. The
in-core workload is much less affected by the background IO, even
though IO throughput itself increased. Same reclaim improvements as
before: reduced swapping, page faults, increased reclaim efficiency,
less writeback from reclaim:

BASE RRALLOC WORKINGSET
Ops memcachetest-0M 15294.00 ( 0.00%) 15492.00 ( 1.29%) 16420.00 ( 7.36%)
Ops memcachetest-375M 15574.00 ( 0.00%) 15510.00 ( -0.41%) 16602.00 ( 6.60%)
Ops memcachetest-1252M 8908.00 ( 0.00%) 15733.00 ( 76.62%) 16640.00 ( 86.80%)
Ops memcachetest-2130M 2652.00 ( 0.00%) 16089.00 (506.67%) 16764.00 (532.13%)
Ops io-duration-0M 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops io-duration-375M 6.00 ( 0.00%) 5.00 ( 16.67%) 6.00 ( 0.00%)
Ops io-duration-1252M 52.00 ( 0.00%) 17.00 ( 67.31%) 17.00 ( 67.31%)
Ops io-duration-2130M 124.00 ( 0.00%) 30.00 ( 75.81%) 30.00 ( 75.81%)
Ops swaptotal-0M 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops swaptotal-375M 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops swaptotal-1252M 169167.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops swaptotal-2130M 278835.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops swapin-0M 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops swapin-375M 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops swapin-1252M 78117.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops swapin-2130M 135073.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops minorfaults-0M 776489.00 ( 0.00%) 779312.00 ( -0.36%) 783329.00 ( -0.88%)
Ops minorfaults-375M 778665.00 ( 0.00%) 780201.00 ( -0.20%) 784954.00 ( -0.81%)
Ops minorfaults-1252M 898776.00 ( 0.00%) 781391.00 ( 13.06%) 785025.00 ( 12.66%)
Ops minorfaults-2130M 838654.00 ( 0.00%) 782741.00 ( 6.67%) 785580.00 ( 6.33%)
Ops majorfaults-0M 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops majorfaults-375M 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops majorfaults-1252M 10916.00 ( 0.00%) 38.00 ( 99.65%) 38.00 ( 99.65%)
Ops majorfaults-2130M 19278.00 ( 0.00%) 38.00 ( 99.80%) 38.00 ( 99.80%)

BASE RRALLOC WORKINGSET
User 521.34 654.91 671.03
System 1694.60 2181.44 2157.61
Elapsed 4781.91 4701.73 4700.31

BASE RRALLOC WORKINGSET
Page Ins 3609444 18304 18296
Page Outs 23111464 19283920 19285644
Swap Ins 831734 0 0
Swap Outs 950459 0 0
Direct pages scanned 354478 0 1061
Kswapd pages scanned 6490315 2808074 2875760
Kswapd pages reclaimed 3116126 2808050 2875738
Direct pages reclaimed 324821 0 1061
Kswapd efficiency 48% 99% 99%
Kswapd velocity 1357.264 597.243 611.823
Direct efficiency 91% 100% 100%
Direct velocity 74.129 0.000 0.226
Percentage direct scans 5% 0% 0%
Page writes by reclaim 2088376 0 0
Page writes file 1137917 0 0
Page writes anon 950459 0 0
Page reclaim immediate 195121 0 0
Page rescued immediate 0 0 0
Slabs scanned 35328 0 0
Direct inode steals 0 0 0
Kswapd inode steals 19613 0 0
Kswapd skipped wait 0 0 0
THP fault alloc 8 3 0
THP collapse alloc 2495 871 1025
THP splits 18 10 7
THP fault fallback 0 0 0
THP collapse fail 24 65 59
Compaction stalls 66 2 2
Compaction success 45 0 0
Compaction failures 21 2 2
Page migrate success 39331 0 0
Page migrate failure 0 0 0
Compaction pages isolated 84996 0 0
Compaction migrate scanned 59149 0 0
Compaction free scanned 916327 0 0
Compaction cost 42 0 0
NUMA PTE updates 0 0 0
NUMA hint faults 0 0 0
NUMA hint local faults 0 0 0
NUMA pages migrated 0 0 0
AutoNUMA cost 0 0 0

---

Patch #1 solves a fairness problem we have with the per-zone LRU
lists, where the time a file cache page gets in memory is dependent on
the zone it gets allocated from. The proposed solution is a very
simple (and maybe too crude) round-robin allocator. It's a problem
that exists without this patch series, but the thrash detection
fundamentally relies on fair aging, so this is included here.

Patches #2-#6 prepare the page cache radix tree for non-page entries
that represent evicted pages.

Patch #7 prepares the page cache allocation path for passing down
refault information from the fault handler down to the page allocator,
which will later use it to prime the reclaim scanner for list
rebalancing.

Patch #9 is the thrash detection code.

Patch #10 is to keep the eviction history in check by both throttling
the number of non-page entries remembered in the radix trees when the
per-file refault ratio is very small and by having a shrinker that
trims those entries when they still grow excessively.

fs/btrfs/compression.c | 9 +-
fs/cachefiles/rdwr.c | 25 ++-
fs/ceph/xattr.c | 2 +-
fs/inode.c | 8 +-
fs/logfs/readwrite.c | 9 +-
fs/nfs/blocklayout/blocklayout.c | 2 +-
fs/nilfs2/inode.c | 4 +-
fs/ntfs/file.c | 10 +-
fs/splice.c | 9 +-
include/linux/fs.h | 3 +
include/linux/gfp.h | 18 +-
include/linux/mm.h | 8 +
include/linux/mmzone.h | 9 +
include/linux/pagemap.h | 59 ++++--
include/linux/pagevec.h | 3 +
include/linux/radix-tree.h | 5 +-
include/linux/shmem_fs.h | 1 +
include/linux/swap.h | 9 +
include/linux/vm_event_item.h | 1 +
include/linux/writeback.h | 1 +
lib/radix-tree.c | 105 +++------
mm/Makefile | 2 +-
mm/filemap.c | 289 ++++++++++++++++++++-----
mm/memcontrol.c | 3 +
mm/mempolicy.c | 17 +-
mm/mincore.c | 20 +-
mm/mmzone.c | 1 +
mm/page-writeback.c | 2 +-
mm/page_alloc.c | 90 +++++---
mm/readahead.c | 12 +-
mm/shmem.c | 122 +++--------
mm/swap.c | 22 ++
mm/truncate.c | 78 +++++--
mm/vmscan.c | 45 +++-
mm/vmstat.c | 4 +
mm/workingset.c | 423 +++++++++++++++++++++++++++++++++++++
net/ceph/pagelist.c | 4 +-
net/ceph/pagevec.c | 2 +-
38 files changed, 1083 insertions(+), 353 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Rafael Aquini: "[PATCH] swap: avoid read_swap_cache_async() race to deadlock while waiting on discard I/O compeletion"
Previous message: Johannes Weiner: "[patch 01/10] mm: page_alloc: zone round-robin allocator"
Next in thread: Johannes Weiner: "[patch 01/10] mm: page_alloc: zone round-robin allocator"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]