Re: Regression of madvise(MADV_COLD) on shmem?

From: Yu Zhao
Date: Sat Mar 05 2022 - 04:18:00 EST


On Fri, Mar 4, 2022 at 5:18 PM Minchan Kim <minchan@xxxxxxxxxx> wrote:
>
> On Fri, Mar 04, 2022 at 05:55:58PM +0000, Ivan Teterevkov wrote:
> > Hi folks,
> >
> > I want to check if there's a regression in the madvise(MADV_COLD) behaviour
> > with shared memory or my understanding of how it works is inaccurate.
> >
> > The MADV_COLD advice was introduced in Linux 5.4 and allowed the users to
> > mark selected memory ranges as more "inactive" than others, overruling the
> > default LRU accounting. It helped to preserve the working set of an
> > application. With more recent kernels, e.g. at least 5.17.0-rc6 and 5.10.42,
> > MADV_COLD has stopped working as expected. Please take a look at a short
> > program that demonstrates it:
> >
> > /*
> > * madvise(MADV_COLD) demo.
> > */
> > #include <assert.h>
> > #include <stdio.h>
> > #include <stdlib.h>
> > #include <string.h>
> > #include <sys/mman.h>
> >
> > /* Requires the kernel 5.4 or newer. */
> > #ifndef MADV_COLD
> > #define MADV_COLD 20
> > #endif
> >
> > #define GIB(x) ((size_t)(x) << 30)
> >
> > int main(void)
> > {
> > char *shmem, *zeroes;
> > int page_size = getpagesize();
> > size_t i;
> >
> > /* Allocate 8 GiB of shared memory. */
> > shmem = mmap(/* addr */ NULL,
> > /* length */ GIB(8),
> > /* prot */ PROT_READ | PROT_WRITE,
> > /* flags */ MAP_SHARED | MAP_ANONYMOUS,
> > /* fd */ -1,
> > /* offset */ 0);
> > assert(shmem != MAP_FAILED);
> >
> > /* Allocate a zero page for future use. */
> > zeroes = calloc(1, page_size);
> > assert(zeroes != NULL);
> >
> > /* Put 1 GiB blob at the beginning of the shared memory range. */
> > memset(shmem, 0xaa, GIB(1));
> >
> > /* Read memory adjacent to the blob. */
> > for (i = GIB(1); i < GIB(8); i = i + page_size) {
> > int res = memcmp(shmem + i, zeroes, page_size);
> > assert(res == 0);
> >
> > /* Cooldown a zero page and make it "less active" than the blob.
> > * Under memory pressure, it'll likely become a reclaim target
> > * and thus will help to preserve the blob in memory.
> > */
> > res = madvise(shmem + i, page_size, MADV_COLD);
> > assert(res == 0);
> > }
> >
> > /* Let the user check smaps. */
> > printf("done\n");
> > pause();
> >
> > free(zeroes);
> > munmap(shmem, GIB(8));
> >
> > return 0;
> > }
> >
> > How to run this program:
> >
> > 1. Create a "test" cgroup with a memory limit of 3 GiB.
> >
> > 1.1. cgroup v1:
> >
> > # mkdir /sys/fs/cgroup/memory/test
> > # echo 3G > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
> >
> > 1.2. cgroup v2:
> >
> > # mkdir /sys/fs/cgroup/test
> > # echo 3G > /sys/fs/cgroup/test/memory.max
> >
> > 2. Enable at least a 1 GiB swap device.
> >
> > 3. Run the program in the "test" cgroup:
> >
> > # cgexec -g memory:test ./a.out
> >
> > 4. Wait until it has finished, i.e. has printed "done".
> >
> > 5. Check the shared memory VMA stats.
> >
> > 5.1. In 5.17.0-rc6 and 5.10.42:
> >
> > # cat /proc/$(pidof a.out)/smaps | grep -A 21 -B 1 8388608
> > 7f8ed4648000-7f90d4648000 rw-s 00000000 00:01 2055 /dev/zero
> > (deleted)
> > Size: 8388608 kB
> > KernelPageSize: 4 kB
> > MMUPageSize: 4 kB
> > Rss: 3119556 kB
> > Pss: 3119556 kB
> > Shared_Clean: 0 kB
> > Shared_Dirty: 0 kB
> > Private_Clean: 3119556 kB
> > Private_Dirty: 0 kB
> > Referenced: 0 kB
> > Anonymous: 0 kB
> > LazyFree: 0 kB
> > AnonHugePages: 0 kB
> > ShmemPmdMapped: 0 kB
> > FilePmdMapped: 0 kB
> > Shared_Hugetlb: 0 kB
> > Private_Hugetlb: 0 kB
> > Swap: 1048576 kB
> > SwapPss: 0 kB
> > Locked: 0 kB
> > THPeligible: 0
> > VmFlags: rd wr sh mr mw me ms sd
> >
> > 5.2. In 5.4.109:
> >
> > # cat /proc/$(pidof a.out)/smaps | grep -A 21 -B 1 8388608
> > 7fca5f78b000-7fcc5f78b000 rw-s 00000000 00:01 173051 /dev/zero
> > (deleted)
> > Size: 8388608 kB
> > KernelPageSize: 4 kB
> > MMUPageSize: 4 kB
> > Rss: 3121504 kB
> > Pss: 3121504 kB
> > Shared_Clean: 0 kB
> > Shared_Dirty: 0 kB
> > Private_Clean: 2072928 kB
> > Private_Dirty: 1048576 kB
> > Referenced: 0 kB
> > Anonymous: 0 kB
> > LazyFree: 0 kB
> > AnonHugePages: 0 kB
> > ShmemPmdMapped: 0 kB
> > FilePmdMapped: 0 kB
> > Shared_Hugetlb: 0 kB
> > Private_Hugetlb: 0 kB
> > Swap: 0 kB
> > SwapPss: 0 kB
> > Locked: 0 kB
> > THPeligible: 0
> > VmFlags: rd wr sh mr mw me ms
> >
> > There's a noticeable difference in the "Swap" reports so that the older
> > kernel doesn't swap the blob, but the newer ones do.
> >
> > According to ftrace, the newer kernels still call deactivate_page() in
> > madvise_cold():
> >
> > # trace-cmd record -p function_graph -g madvise_cold
> > # trace-cmd report | less
> > a.out-4877 [000] 1485.266106: funcgraph_entry: | madvise_cold() {
> > a.out-4877 [000] 1485.266115: funcgraph_entry: | walk_page_range()
> > {
> > a.out-4877 [000] 1485.266116: funcgraph_entry: |
> > __walk_page_range() {
> > a.out-4877 [000] 1485.266117: funcgraph_entry: |
> > madvise_cold_or_pageout_pte_range() {
> > a.out-4877 [000] 1485.266118: funcgraph_entry: 0.179 us |
> > deactivate_page();
> >
> > (The irrelevant bits are removed for brevity.)
> >
> > It makes me think there may be a regression in MADV_COLD. Please let me know
> > what do you reckon?
>
> Since deactive_page is called, I guess that's not a regression(?) from [1]
>
> Then, my random guess that you mentioned "Swap" as regression might be
> related to "workingset detection for anon page" since kernel changes balancing
> policy between file and anonymous LRU, which was merged into v5.8.
> It would be helpful to see if you try it on v5.7 and v5.8.
>
> [1] 12e967fd8e4e6, mm: do not allow MADV_PAGEOUT for CoW page

Yes, I noticed this for a while. With commit b518154e59a ("mm/vmscan:
protect the workingset on anonymous LRU"), anon/shmem pages start on
the inactive lru, and in this case, deactive_page() is a NOP. Before
5.9, anon/shmem pages start on the active lru, deactive_page() moves
zero pages in the test to the inactive lru and therefore protests the
"blob".

This should fix the problem for your test case:

diff --git a/mm/swap.c b/mm/swap.c
index bcf3ac288b56..7fd99f037ca7 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -563,7 +559,7 @@ static void lru_deactivate_file_fn(struct page
*page, struct lruvec *lruvec)

static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec)
{
- if (PageActive(page) && !PageUnevictable(page)) {
+ if (!PageUnevictable(page)) {
int nr_pages = thp_nr_pages(page);

del_page_from_lru_list(page, lruvec);
@@ -677,7 +673,7 @@ void deactivate_file_page(struct page *page)
*/
void deactivate_page(struct page *page)
{
- if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
+ if (PageLRU(page) && !PageUnevictable(page)) {
struct pagevec *pvec;

local_lock(&lru_pvecs.lock);

I'll leave it to Minchan to decide whether this is worth fixing,
together with this one:

diff --git a/mm/swap.c b/mm/swap.c
index bcf3ac288b56..2f142f09c8e1 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -529,10 +529,6 @@ static void lru_deactivate_file_fn(struct page
*page, struct lruvec *lruvec)
if (PageUnevictable(page))
return;

- /* Some processes are using the page */
- if (page_mapped(page))
- return;
-
del_page_from_lru_list(page, lruvec);
ClearPageActive(page);
ClearPageReferenced(page);