On 18.06.25 11:52, Barry Song wrote:
On Wed, Jun 18, 2025 at 10:25 AM Lance Yang <lance.yang@xxxxxxxxx> wrote:
Hi all,
Crazy, the per-VMA lock for madvise is an absolute game-changer ;)
On 2025/6/17 21:38, Lorenzo Stoakes wrote:
[...]
On Sun, Jun 08, 2025 at 10:01:50AM +1200, Barry Song wrote:
From: Barry Song <v-songbaohua@xxxxxxxx>
Certain madvise operations, especially MADV_DONTNEED, occur far more
frequently than other madvise options, particularly in native and Java
heaps for dynamic memory management.
Currently, the mmap_lock is always held during these operations, even when
unnecessary. This causes lock contention and can lead to severe priority
inversion, where low-priority threads—such as Android's HeapTaskDaemon—
hold the lock and block higher-priority threads.
This patch enables the use of per-VMA locks when the advised range lies
entirely within a single VMA, avoiding the need for full VMA traversal. In
practice, userspace heaps rarely issue MADV_DONTNEED across multiple VMAs.
Tangquan’s testing shows that over 99.5% of memory reclaimed by Android
benefits from this per-VMA lock optimization. After extended runtime,
217,735 madvise calls from HeapTaskDaemon used the per-VMA path, while
only 1,231 fell back to mmap_lock.
To simplify handling, the implementation falls back to the standard
mmap_lock if userfaultfd is enabled on the VMA, avoiding the complexity of
userfaultfd_remove().
Many thanks to Lorenzo's work[1] on:
"Refactor the madvise() code to retain state about the locking mode
utilised for traversing VMAs.
Then use this mechanism to permit VMA locking to be done later in the
madvise() logic and also to allow altering of the locking mode to permit
falling back to an mmap read lock if required."
One important point, as pointed out by Jann[2], is that
untagged_addr_remote() requires holding mmap_lock. This is because
address tagging on x86 and RISC-V is quite complex.
Until untagged_addr_remote() becomes atomic—which seems unlikely in
the near future—we cannot support per-VMA locks for remote processes.
So for now, only local processes are supported.
Just to put some numbers on it, I ran a micro-benchmark with 100
parallel threads, where each thread calls madvise() on its own 1GiB
chunk of 64KiB mTHP-backed memory. The performance gain is huge:
1) MADV_DONTNEED saw its average time drop from 0.0508s to 0.0270s (~47%
faster)
2) MADV_FREE saw its average time drop from 0.3078s to 0.1095s (~64%
faster)
Thanks for the report, Lance. I assume your micro-benchmark includes some
explicit or implicit operations that may require mmap_write_lock().
As mmap_read_lock() only waits for writers and does not block other
mmap_read_lock() calls.
The number rather indicate that one test was run with (m)THPs enabled and the other not? Just a thought. The locking overhead from my experience is not that significant.