[PATCH] mm/khugepaged: Fix ->anon_vma race

From: Jann Horn
Date: Wed Jan 11 2023 - 08:36:45 EST


If an ->anon_vma is attached to the VMA, collapse_and_free_pmd() requires
it to be locked. retract_page_tables() bails out if an ->anon_vma is
attached, but does this check before holding the mmap lock (as the comment
above the check explains).

If we racily merge an existing ->anon_vma (shared with a child process)
from a neighboring VMA, subsequent rmap traversals on pages belonging to
the child will be able to see the page tables that we are concurrently
removing while assuming that nothing else can access them.

Repeat the ->anon_vma check once we hold the mmap lock to ensure that there
really is no concurrent page table access.

Reported-by: Zach O'Keefe <zokeefe@xxxxxxxxxx>
Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
Cc: stable@xxxxxxxxxxxxxxx
Signed-off-by: Jann Horn <jannh@xxxxxxxxxx>
---
zokeefe@ pointed out to me that the current code (after my last round of patches)
can hit a lockdep assert by racing, and after staring at it a bit I've
convinced myself that this is a real, preexisting bug.
(I haven't written a reproducer for it though. One way to hit it might be
something along the lines of:

- set up a process A with a private-file-mapping VMA V1
- let A fork() to create process B, thereby copying V1 in A to V1' in B
- let B extend the end of V1'
- let B put some anon pages into the extended part of V1'
- let A map a new private-file-mapping VMA V2 directly behind V1, without
an anon_vma
[race begins here]
- in A's thread 1: begin retract_page_tables() on V2, run through first
->anon_vma check
- in A's thread 2: run __anon_vma_prepare() on V2 and ensure that it
merges the anon_vma of V1 (which implies V1 and V2 must be mapping the
same file at compatible offsets)
- in B: trigger rmap traversal on anon page in V1'

mm/khugepaged.c | 14 +++++++++++++-
1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 5cb401aa2b9d..0bfed37f3a3b 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1644,7 +1644,7 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
* has higher cost too. It would also probably require locking
* the anon_vma.
*/
- if (vma->anon_vma) {
+ if (READ_ONCE(vma->anon_vma)) {
result = SCAN_PAGE_ANON;
goto next;
}
@@ -1672,6 +1672,18 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
result = SCAN_PTE_MAPPED_HUGEPAGE;
if ((cc->is_khugepaged || is_target) &&
mmap_write_trylock(mm)) {
+ /*
+ * Re-check whether we have an ->anon_vma, because
+ * collapse_and_free_pmd() requires that either no
+ * ->anon_vma exists or the anon_vma is locked.
+ * We already checked ->anon_vma above, but that check
+ * is racy because ->anon_vma can be populated under the
+ * mmap lock in read mode.
+ */
+ if (vma->anon_vma) {
+ result = SCAN_PAGE_ANON;
+ goto unlock_next;
+ }
/*
* When a vma is registered with uffd-wp, we can't
* recycle the pmd pgtable because there can be pte

base-commit: 7dd4b804e08041ff56c88bdd8da742d14b17ed25
--
2.39.0.314.g84b9a713c41-goog