Re: [PATCH RFC v2] mm: use per_vma lock for MADV_DONTNEED

From: Qi Zheng
Date: Wed Jun 04 2025 - 02:02:31 EST


Hi Lorenzo,

On 6/3/25 5:54 PM, Lorenzo Stoakes wrote:
On Tue, Jun 03, 2025 at 03:24:28PM +0800, Qi Zheng wrote:
Hi Jann,

On 5/30/25 10:06 PM, Jann Horn wrote:
On Fri, May 30, 2025 at 12:44 PM Barry Song <21cnbao@xxxxxxxxx> wrote:
Certain madvise operations, especially MADV_DONTNEED, occur far more
frequently than other madvise options, particularly in native and Java
heaps for dynamic memory management.

Currently, the mmap_lock is always held during these operations, even when
unnecessary. This causes lock contention and can lead to severe priority
inversion, where low-priority threads—such as Android's HeapTaskDaemon—
hold the lock and block higher-priority threads.

This patch enables the use of per-VMA locks when the advised range lies
entirely within a single VMA, avoiding the need for full VMA traversal. In
practice, userspace heaps rarely issue MADV_DONTNEED across multiple VMAs.

Tangquan’s testing shows that over 99.5% of memory reclaimed by Android
benefits from this per-VMA lock optimization. After extended runtime,
217,735 madvise calls from HeapTaskDaemon used the per-VMA path, while
only 1,231 fell back to mmap_lock.

To simplify handling, the implementation falls back to the standard
mmap_lock if userfaultfd is enabled on the VMA, avoiding the complexity of
userfaultfd_remove().

One important quirk of this is that it can, from what I can see, cause
freeing of page tables (through pt_reclaim) without holding the mmap
lock at all:

do_madvise [behavior=MADV_DONTNEED]
madvise_lock
lock_vma_under_rcu
madvise_do_behavior
madvise_single_locked_vma
madvise_vma_behavior
madvise_dontneed_free
madvise_dontneed_single_vma
zap_page_range_single_batched [.reclaim_pt = true]
unmap_single_vma
unmap_page_range
zap_p4d_range
zap_pud_range
zap_pmd_range
zap_pte_range
try_get_and_clear_pmd
free_pte

This clashes with the assumption in walk_page_range_novma() that
holding the mmap lock in write mode is sufficient to prevent
concurrent page table freeing, so it can probably lead to page table
UAF through the ptdump interface (see ptdump_walk_pgd()).

Maybe not? The PTE page is freed via RCU in zap_pte_range(), so in the
following case:

cpu 0 cpu 1

ptdump_walk_pgd
--> walk_pte_range
--> pte_offset_map (hold RCU read lock)
zap_pte_range
--> free_pte (via RCU)
walk_pte_range_inner
--> ptdump_pte_entry (the PTE page is not freed at this time)

IIUC, there is no UAF issue here?

If I missed anything please let me know.

Thanks,
Qi



I forgot about that interesting placement of RCU lock acquisition :) I will
obviously let Jann come back to you on this, but I wonder if I need to
update the doc to reflect this actually.

I saw that there is already a relevant description in process_addrs.rst:


```
So accessing PTE-level page tables requires at least holding an RCU read lock;
but that only suffices for readers that can tolerate racing with concurrent
page table updates such that an empty PTE is observed (in a page table that
has actually already been detached and marked for RCU freeing) while another
new page table has been installed in the same location and filled with
entries. Writers normally need to take the PTE lock and revalidate that the
PMD entry still refers to the same PTE-level page table.
If the writer does not care whether it is the same PTE-level page table, it
can take the PMD lock and revalidate that the contents of pmd entry still meet
the requirements. In particular, this also happens in :c:func:`!retract_page_tables`
when handling :c:macro:`!MADV_COLLAPSE`.
```

Thanks!