[RFC PATCH 0/1] mm: Support multiple kswapd threads per node

From: Buddy Lumpkin
Date: Mon Apr 02 2018 - 05:25:13 EST


I created this patch to address performance problems we are seeing in
Oracle Cloud Infrastructure. We run the Oracle Linux UEK4 kernel
internally, which is based on upstream 4.1. I created and tested this
patch for the latest upstream kernel and UEK4. I was able to show
substantial benefits in both kernels, using workloads that provide a
mix of anonymous memory allocations with filesystem writes.

As I went through the process of getting this patch approved internally, I
learned that it was hard to come up with a concise set of test results
that clearly demonstrate that devoting more threads toward proactive page
replacement is actually necessary. I was more focused on the impact that
direct reclaims had on latency at the time, so I came up with a systemtap
script that measures the latency of direct reclaims. On systems that were
doing large volumes of filesystem IO, I saw order 0 allocations regularly
taking over 10ms, and occasionally over 100ms. Since we were seeing large
volumes of direct reclaims triggered as a side effect of filesystem IO, I
figured this had to have a substantial impact on throughput.

I compared the maximum read throughput that could be obtained using direct
IO streams to standard filesystem IO through the page cache on one of the
dense storage systems that we vend. Direct IO was 55% higher in throughput
than standard filesystem IO. I can't remember the last time I measured
this but I know it was over 15 years ago, and I am quite sure the number
was no more than 10%. I was pretty sure that direct reclaims were to blame
for most of this and it would only take a few more tests to prove it. At
23GB/s, it only takes 32.6 seconds to fill the page cache on one of these
systems, but that is enough time to measure throughput without any page
replacement occuring. In this case direct IO throughput was only 13.5%
higher. It was pretty clear that direct reclaims were causing a
substantial reduction in throughput. I decided this would be the ideal way
to show the benefits of threading kswapd.

On the UEK4 kernel, six kswapd threads provided a 48% increase over one.
When I ran the same tests on upstream kernel version 4.16.0-rc7, I only
saw a 20% increase with 6 threads and the numbers fluctuated quite a bit
when I watched with iostat with a 2 second sample interval. The output
stalled periodically as well. When I profiled the system using perf, I
saw that 70% of the CPU time was being spent in a single function, it was
native_queued_spin_lock_slowpath(). 38% was during shrink_inactive_list()
and another 34% was spent during __lru_cache_add()

I eventually determined that my tests were presenting a difficult pattern
for the logic that uses shadow entries to periodically resize the LRU
lists. This was not a problem in the UEK4 kernel which also has shadow
entries, so something has changed in that regard. I have not had time to
really dig into this particular problem however, I assume those that are
more familiar with the code might see the test results below and have an
idea about what is going on.

I have appended a small patch to the end of this cover letter that
effectively disables most of the routines in mm/workingset.c so that
filesystem IO can be used to demonstrate the benefits of a threaded
kswapd. I am not suggesting that this is the correct solution for this
problem.

Test results below are the same that were run to demonstrate threaded
kswapd performance. For more context, read the patch commit log before
continuing and the test results below will make more sense

Direct IO results are roughly the same as expected ...

Test #1: Direct IO - shadow entries enabled
dd sy dd_cpu throughput
6 0 2.33 14726026.40
10 1 2.95 19954974.80
16 1 2.63 24419689.30
22 1 2.63 25430303.20
28 1 2.91 26026513.20
34 1 2.53 26178618.00
40 1 2.18 26239229.20
46 1 1.91 26250550.40
52 1 1.69 26251845.60
58 1 1.54 26253205.60
64 1 1.43 26253780.80
70 1 1.31 26254154.80
76 1 1.21 26253660.80
82 1 1.12 26254214.80
88 1 1.07 26253770.00
90 1 1.04 26252406.40

Going through the pagecache is a different story entirely. Let's look at
throughput with a single kswapd thread with shadow entries enabled, vs
disabled:

shadow entries ENABLED, 1 kswapd thread per node
dd sy dd_cpu kswapd0 kswapd1 throughput dr pgscan_kswapd pgscan_direct
10 5 27.96 35.52 34.94 7964174.80 0 460161197 0
16 8 40.75 84.86 81.92 11143540.00 0 907793664 0
22 12 45.01 99.96 99.98 12790778.40 6751 884827215 162344947
28 18 49.10 99.97 99.97 14410621.02 17989 719328362 536886953
34 22 52.87 99.80 99.98 14331978.80 25180 609680315 661201785
40 26 55.66 99.90 99.96 14612901.20 26843 449047388 810399311
46 28 56.37 99.74 99.96 15831410.40 33854 518952367 807944791
52 37 59.78 99.80 99.97 15264190.80 37042 372258422 881626890
58 50 71.90 99.44 99.53 14979692.40 45761 190511392 1114810023
64 53 72.14 99.84 99.95 14747164.80 83665 168461850 1013498958
70 50 68.09 99.80 99.90 15176129.60 113546 203506041 1008655113
76 59 73.77 99.73 99.96 14947922.40 98798 137174015 1057487320
82 66 79.25 99.66 99.98 14624100.40 100242 101830859 1074332196
88 73 81.26 98.85 99.98 14827533.60 101262 90402914 1086186724
90 78 85.48 99.55 99.98 14469963.20 101063 75722196 1083603245

shadow entries DISABLED, 1 kswapd thread per node
dd sy dd_cpu kswapd0 kswapd1 throughput dr pgscan_kswapd pgscan_direct
10 4 26.07 28.56 27.03 7355924.40 0 459316976 0
16 7 34.94 69.33 69.66 10867895.20 0 872661643 0
22 10 36.03 93.99 99.33 13130613.60 489 1037654473 11268334
28 10 30.34 95.90 98.60 14601509.60 671 1182591373 15429142
34 14 34.77 97.50 99.23 16468012.00 10850 1069005644 249839515
40 17 36.32 91.49 97.11 17335987.60 18903 975417728 434467710
46 19 38.40 90.54 91.61 17705394.40 25369 855737040 582427973
52 22 40.88 83.97 83.70 17607680.40 31250 709532935 724282458
58 25 40.89 82.19 80.14 17976905.60 35060 657796473 804117540
64 28 41.77 73.49 75.20 18001910.00 39073 561813658 895289337
70 33 45.51 63.78 64.39 17061897.20 44523 379465571 1020726436
76 36 46.95 57.96 60.32 16964459.60 47717 291299464 1093172384
82 39 47.16 55.43 56.16 16949956.00 49479 247071062 1134163008
88 42 47.41 53.75 47.62 16930911.20 51521 195449924 1180442208
90 43 47.18 51.40 50.59 16864428.00 51618 190758156 1183203901

When shadow entries are disabled, kernel mode CPU consumption drops and
peak throughput increases by 13.7%

Here is the same test with 4 kswapd threads:

shadow entries ENABLED, 4 kswapd threads per node
dd sy dd_cpu kswapd0 kswapd1 throughput dr pgscan_kswapd pgscan_direct
10 6 30.09 17.36 16.82 7692440.40 0 460386412 0
16 11 42.86 34.35 33.86 10836456.80 23 885908695 550482
22 14 46.00 55.30 50.53 13125285.20 0 1075382922 0
28 17 43.74 87.18 44.18 15298355.20 0 1254927179 0
34 26 53.78 99.88 89.93 16203179.20 3443 1247514636 80817567
40 35 62.99 99.88 97.58 16653526.80 15376 960519369 369681969
46 36 51.66 99.85 90.87 18668439.60 10907 1239045416 259575692
52 46 66.96 99.61 99.96 16970211.60 24264 751180033 577278765
58 52 76.53 99.91 99.97 15336601.60 30676 513418729 725394427
64 58 78.20 99.79 99.96 15266654.40 33466 450869495 791218349
70 65 82.98 99.93 99.98 15285421.60 35647 370270673 843608871
76 69 81.52 99.87 99.87 15681812.00 37625 358457523 889023203
82 78 85.68 99.97 99.98 15370775.60 39010 302132025 921379809
88 85 88.52 99.88 99.56 15410439.20 40100 267031806 947441566
90 88 90.11 99.67 99.41 15400593.20 40443 249090848 953893493

shadow entries DISABLED, 4 kswapd threads per node
dd sy dd_cpu kswapd0 kswapd1 throughput dr pgscan_kswapd pgscan_direct
10 5 27.09 16.65 14.17 7842605.60 0 459105291 0
16 10 37.12 26.02 24.85 11352920.40 15 920527796 358515
22 11 36.94 37.13 35.82 13771869.60 0 1132169011 0
28 13 35.23 48.43 46.86 16089746.00 0 1312902070 0
34 15 33.37 53.02 55.69 18314856.40 0 1476169080 0
40 19 35.90 69.60 64.41 19836126.80 0 1629999149 0
46 22 36.82 88.55 57.20 20740216.40 0 1708478106 0
52 24 34.38 93.76 68.34 21758352.00 0 1794055559 0
58 24 30.51 79.20 82.33 22735594.00 0 1872794397 0
64 26 30.21 97.12 76.73 23302203.60 176 1916593721 4206821
70 33 32.92 92.91 92.87 23776588.00 3575 1817685086 85574159
76 37 31.62 91.20 89.83 24308196.80 4752 1812262569 113981763
82 29 25.53 93.23 92.33 24802791.20 306 2032093122 7350704
88 43 37.12 76.18 77.01 25145694.40 20310 1253204719 487048202
90 42 38.56 73.90 74.57 22516787.60 22774 1193637495 545463615

With four kswapd threads, the effects are more pronounced. Kernel mode CPU
consumption is substantially higher with shadow entries enabled while
throughput is substantially lower.

When shadow entries are disabled, additional kswapd tasks increase
throughput while kernel mode CPU consumption stays roughly the same

---
mm/workingset.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/mm/workingset.c b/mm/workingset.c
index b7d616a3bbbe..656451ce2d5e 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -213,6 +213,7 @@ void *workingset_eviction(struct address_space *mapping, struct page *page)
unsigned long eviction;
struct lruvec *lruvec;

+ return NULL;
/* Page is fully exclusive and pins page->mem_cgroup */
VM_BUG_ON_PAGE(PageLRU(page), page);
VM_BUG_ON_PAGE(page_count(page), page);
--

Buddy Lumpkin (1):
vmscan: Support multiple kswapd threads per node

Documentation/sysctl/vm.txt | 21 ++++++++
include/linux/mm.h | 2 +
include/linux/mmzone.h | 10 +++-
kernel/sysctl.c | 10 ++++
mm/page_alloc.c | 15 ++++++
mm/vmscan.c | 116 +++++++++++++++++++++++++++++++++++++-------
6 files changed, 155 insertions(+), 19 deletions(-)

--
1.8.3.1