Re: [PATCH RFC v2] mm: use per_vma lock for MADV_DONTNEED

From: Qi Zheng
Date: Tue Jun 03 2025 - 03:24:43 EST


Hi Jann,

On 5/30/25 10:06 PM, Jann Horn wrote:
On Fri, May 30, 2025 at 12:44 PM Barry Song <21cnbao@xxxxxxxxx> wrote:
Certain madvise operations, especially MADV_DONTNEED, occur far more
frequently than other madvise options, particularly in native and Java
heaps for dynamic memory management.

Currently, the mmap_lock is always held during these operations, even when
unnecessary. This causes lock contention and can lead to severe priority
inversion, where low-priority threads—such as Android's HeapTaskDaemon—
hold the lock and block higher-priority threads.

This patch enables the use of per-VMA locks when the advised range lies
entirely within a single VMA, avoiding the need for full VMA traversal. In
practice, userspace heaps rarely issue MADV_DONTNEED across multiple VMAs.

Tangquan’s testing shows that over 99.5% of memory reclaimed by Android
benefits from this per-VMA lock optimization. After extended runtime,
217,735 madvise calls from HeapTaskDaemon used the per-VMA path, while
only 1,231 fell back to mmap_lock.

To simplify handling, the implementation falls back to the standard
mmap_lock if userfaultfd is enabled on the VMA, avoiding the complexity of
userfaultfd_remove().

One important quirk of this is that it can, from what I can see, cause
freeing of page tables (through pt_reclaim) without holding the mmap
lock at all:

do_madvise [behavior=MADV_DONTNEED]
madvise_lock
lock_vma_under_rcu
madvise_do_behavior
madvise_single_locked_vma
madvise_vma_behavior
madvise_dontneed_free
madvise_dontneed_single_vma
zap_page_range_single_batched [.reclaim_pt = true]
unmap_single_vma
unmap_page_range
zap_p4d_range
zap_pud_range
zap_pmd_range
zap_pte_range
try_get_and_clear_pmd
free_pte

This clashes with the assumption in walk_page_range_novma() that
holding the mmap lock in write mode is sufficient to prevent
concurrent page table freeing, so it can probably lead to page table
UAF through the ptdump interface (see ptdump_walk_pgd()).

Maybe not? The PTE page is freed via RCU in zap_pte_range(), so in the
following case:

cpu 0 cpu 1

ptdump_walk_pgd
--> walk_pte_range
--> pte_offset_map (hold RCU read lock)
zap_pte_range
--> free_pte (via RCU)
walk_pte_range_inner
--> ptdump_pte_entry (the PTE page is not freed at this time)

IIUC, there is no UAF issue here?

If I missed anything please let me know.

Thanks,
Qi