Re: [PATCH] mm/vmscan: add sysctl knobs for protecting the working set

From: Alexey Avramov
Date: Fri Dec 03 2021 - 08:27:37 EST


>I'd also like to know where that malfunction happens in this case.

User-space processes need to always access shared libraries to work.
It can be tens or hundreds of megabytes, depending on the type of workload.
This is a hot cache, which is pushed out and then read leads to thrashing.
There is no way in the kernel to forbid evicting the minimum file cache.
This is the problem that the patch solves. And the malfunction is exactly
that - the inability of the kernel to hold the minimum amount of the
hottest cache in memory.

Anothe explanation:

> in normal operation you will have nearly all of your executables nad
> libraries sitting in good ol' physical RAM. But when RAM runs low, but
> not low enough for the out-of-memory killer to be run, these pages are
> evicted from RAM. So you end up with a situation where pages are
> evicted -- at first, no problem, because they are evicted
> least-recently-used first and it kicks out pages you aren't using
> anyway. But then, it kicks out the ones you are using, just to have
> to page them right back in moments later. Thrash city.
-- [0]

Just look at prelockd [1]. This is the process that mlocks mmapped
libraries/binaries of existing processes. The result of it's work:
it's impossible to invoke thrashing under memory pressure, at least
with noswap. And OOM killer comes *instantly* when it runs.
Please see the demo [2]. The same effect we can get when set
vm.clean_min_kbytes=250000, for example.

>something PSI should be able to help with

PSI acts post-factum: on the basis of PSI we react when memory
pressure is already high. PSI annot help *prevent* thrashing.

Using vm.clean_min_kbytes knob allows you to get OOM *before*
memory/io pressure gets high and keep the system manageable instead
of getting livelock indefinitely.

Demo [3]: playing supertux under stress, fs on HDD,
vm.clean_low_kbytes=250000, no thrashing, no freeze,
io pressure closed to 0.

Yet another demo [4]: no stalls with the case that was reported [5] by
Artem S. Tashkinov in 2019. Interesting that in that thread ndrw
suggested [6] the right solution:

> Would it be possible to reserve a fixed (configurable) amount of RAM
> for caches, and trigger OOM killer earlier, before most UI code is
> evicted from memory? In my use case, I am happy sacrificing e.g. 0.5GB
> and kill runaway tasks _before_ the system freezes. Potentially OOM
> killer would also work better in such conditions. I almost never work
> at close to full memory capacity, it's always a single task that goes
> wrong and brings the system down.

> The problem with PSI sensing is that it works after the fact (after
> the freeze has already occurred). It is not very different from issuing
> SysRq-f manually on a frozen system, although it would still be a
> handy feature for batched tasks and remote access.

but Michal Hocko immediately criticized [7] the proposal unfairly.
This patch just implements ndrw's suggestion.

[0] https://serverfault.com/a/319818
[1] https://github.com/hakavlad/prelockd

[2] https://www.youtube.com/watch?v=vykUrP1UvcI
On this video: running fast memory hog in a loop on Debian 10 GNOME,
4 GiB MemTotal without swap space. FS is ext4 on *HDD*.
- 1. prelockd enabled: about 500 MiB mlocked. Starting
`while true; do tail /dev/zero; done`: no freezes.
The OOM killer comes quickly, the system recovers quickly.
- 2. prelockd disabled: system hangs.

[3] https://www.youtube.com/watch?v=g9GCmp-7WXw
[4] https://www.youtube.com/watch?v=iU3ikgNgp3M
[5] Let's talk about the elephant in the room - the Linux kernel's
inability to gracefully handle low memory pressure
https://lore.kernel.org/all/d9802b6a-949b-b327-c4a6-3dbca485ec20@xxxxxxx/
[6] https://lore.kernel.org/all/806F5696-A8D6-481D-A82F-49DEC1F2B035@xxxxxxxxxxxxxx/
[7] https://lore.kernel.org/all/20190808163228.GE18351@xxxxxxxxxxxxxx/