Re: [PATCH] mm: Stop kswapd early when nothing's waiting for it to free pages

From: Michal Hocko
Date: Tue Feb 25 2020 - 04:09:52 EST


On Fri 21-02-20 13:08:24, Sultan Alsawaf wrote:
[...]
> Both of these logs are attached in a tarball.

Thanks! First of all
$ grep pswp vmstat.1582318979
pswpin 0
pswpout 0

suggests that you do not have any swap storage, right? I will get back
to this later. Now, let's have a look at snapshots. We have regular 1s
snapshots intially but then we have
vmstat.1582318734
vmstat.1582318736
vmstat.1582318758
vmstat.1582318763
vmstat.1582318768
[...]
vmstat.1582318965
vmstat.1582318975
vmstat.1582318976

That is 242s time period when even a simple bash script was struggling
to write a snapshot of a /proc/vmstat which by itself shouldn't really
depend on the system activity much. Let's have a look at a random chosen
two consecutive snapshots from this time period:

vmstat.1582318736 vmstat.1582318758
base diff
allocstall_dma 0 0
allocstall_dma32 0 0
allocstall_movable 5773 0
allocstall_normal 906 0

to my surprise there was no invocation of the direct reclaim in this
time period. I would expect this to be the case considering the long
stall. But the source of the stall might be different than the DR.

compact_stall 13 1

Direct compaction has been invoked but this shouldn't cause a major
stall for all processes.

nr_active_anon 133932 236
nr_inactive_anon 9350 -1179
nr_active_file 318 190
nr_inactive_file 673 56
nr_unevictable 51984 0

The amount of anonymous memory is not really high (~560MB) but file LRU
is _really_ low (~3MB), unevictable list is at ~200MB. That gets us to
~760M of memory which is 74% of the memory. Please note that your mem=2G
setup gives you only 1G of memory in fact (based on the zone_info you
have posted). That is not something unusual but the amount of the page
cache is worrying because I would expect a heavy trashing because most
of the executables are going to require major faults. Anonymous memory
is not swapped out obviously so there is no other option than to refault
constantly.

pgscan_kswapd 64788716 14157035
pgsteal_kswapd 29378868 4393216
pswpin 0 0
pswpout 0 0
workingset_activate 3840226 169674
workingset_refault 29396942 4393013
workingset_restore 2883042 106358

And here we can see it clearly happening. Note how working set refaults
matches the amount of memory reclaimed by kswapd.

I would be really curious whether adding swap space would help some.

Now to your patch and why it helps here. It seems quite obvious that the
only effectively reclaimable memory (page cache) is not going to satisfy
the high watermark target
Node 0, zone DMA32
pages free 87925
min 11090
low 13862
high 16634

kswapd has some feedback mechanism to back off when the zone is hopless
from the reclaim point of view AFAIR but it seems it has failed in this
particular situation. It should have relied on the direct reclaim and
eventually trigger the OOM killer. Your patch has worked around this by
bailing out from the kswapd reclaim too early so a part of the page
cache required for the code to move on would stay resident and move
further.

The proper fix should, however, check the amount of reclaimable pages
and back off if they cannot meet the target IMO. We cannot rely on the
general reclaimability here because that could really be thrashing.

Thoughts?
--
Michal Hocko
SUSE Labs