[PATCH] mm: madvise: MADV_DONTNEED_LOCKED

From: Johannes Weiner
Date: Thu Mar 03 2022 - 16:30:03 EST


MADV_DONTNEED historically rejects mlocked ranges, but with
MLOCK_ONFAULT and MCL_ONFAULT allowing to mlock without populating,
there are valid use cases for depopulating locked ranges as well.

Users mlock memory to protect secrets. There are allocators for secure
buffers that want in-use memory generally mlocked, but cleared and
invalidated memory to give up the physical pages. This could be done
with explicit munlock -> mlock calls on free -> alloc of course, but
that adds two unnecessary syscalls, heavy mmap_sem write locks, vma
splits and re-merges - only to get rid of the backing pages.

Users also mlockall(MCL_ONFAULT) to suppress sustained paging, but are
okay with on-demand initial population. It seems valid to selectively
free some memory during the lifetime of such a process, without having
to mess with its overall policy.

Why add a separate flag? Isn't this a pretty niche usecase?

- MADV_DONTNEED has been bailing on locked vmas forever. It's at least
conceivable that someone, somewhere is relying on mlock to protect
data from perhaps broader invalidation calls. Changing this behavior
now could lead to quiet data corruption.

- It also clarifies expectations around MADV_FREE and maybe
MADV_REMOVE. It avoids the situation where one quietly behaves
different than the others. MADV_FREE_LOCKED can be added later.

- The combination of mlock() and madvise() in the first place is
probably niche. But where it happens, I'd say that dropping pages
from a locked region once they don't contain secrets or won't page
anymore is much saner than relying on mlock to protect memory from
speculative or errant invalidation calls. It's just that we can't
change the default behavior because of the two previous points.

Given that, an explicit new flag seems to make the most sense.

Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx>
---
include/uapi/asm-generic/mman-common.h | 2 ++
mm/madvise.c | 16 +++++++++++++---
2 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 1567a3294c3d..6c1aa92a92e4 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -75,6 +75,8 @@
#define MADV_POPULATE_READ 22 /* populate (prefault) page tables readable */
#define MADV_POPULATE_WRITE 23 /* populate (prefault) page tables writable */

+#define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */
+
/* compatibility flags */
#define MAP_FILE 0

diff --git a/mm/madvise.c b/mm/madvise.c
index 5604064df464..12dfa14bc985 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -800,6 +800,13 @@ static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
return 0;
}

+static bool can_madv_dontneed_free(struct vm_area_struct *vma, int behavior)
+{
+ if (behavior == MADV_DONTNEED_LOCKED)
+ return !(vma->vm_flags & (VM_HUGETLB|VM_PFNMAP));
+ return can_madv_lru_vma(vma);
+}
+
static long madvise_dontneed_free(struct vm_area_struct *vma,
struct vm_area_struct **prev,
unsigned long start, unsigned long end,
@@ -808,7 +815,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
struct mm_struct *mm = vma->vm_mm;

*prev = vma;
- if (!can_madv_lru_vma(vma))
+
+ if (!can_madv_dontneed_free(vma, behavior))
return -EINVAL;

if (!userfaultfd_remove(vma, start, end)) {
@@ -830,7 +838,7 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
*/
return -ENOMEM;
}
- if (!can_madv_lru_vma(vma))
+ if (!can_madv_dontneed_free(vma, behavior))
return -EINVAL;
if (end > vma->vm_end) {
/*
@@ -850,7 +858,7 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
VM_WARN_ON(start >= end);
}

- if (behavior == MADV_DONTNEED)
+ if (behavior == MADV_DONTNEED || behavior == MADV_DONTNEED_LOCKED)
return madvise_dontneed_single_vma(vma, start, end);
else if (behavior == MADV_FREE)
return madvise_free_single_vma(vma, start, end);
@@ -988,6 +996,7 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
return madvise_pageout(vma, prev, start, end);
case MADV_FREE:
case MADV_DONTNEED:
+ case MADV_DONTNEED_LOCKED:
return madvise_dontneed_free(vma, prev, start, end, behavior);
case MADV_POPULATE_READ:
case MADV_POPULATE_WRITE:
@@ -1113,6 +1122,7 @@ madvise_behavior_valid(int behavior)
case MADV_REMOVE:
case MADV_WILLNEED:
case MADV_DONTNEED:
+ case MADV_DONTNEED_LOCKED:
case MADV_FREE:
case MADV_COLD:
case MADV_PAGEOUT:
--
2.35.1