On Fri, May 30, 2025 at 6:42 PM Anthony Yznaga
<anthony.yznaga@xxxxxxxxxx> wrote:
On 5/30/25 7:56 AM, Jann Horn wrote:
On Fri, Apr 4, 2025 at 4:18 AM Anthony Yznaga <anthony.yznaga@xxxxxxxxxx> wrote:
In preparation for enabling the handling of page faults in an mshare[...]
region provide a way to link an mshare shared page table to a process
page table and otherwise find the actual vma in order to handle a page
fault. Modify the unmap path to ensure that page tables in mshare regions
are unlinked and kept intact when a process exits or an mshare region
is explicitly unmapped.
Signed-off-by: Khalid Aziz <khalid@xxxxxxxxxx>
Signed-off-by: Matthew Wilcox (Oracle) <willy@xxxxxxxxxxxxx>
Signed-off-by: Anthony Yznaga <anthony.yznaga@xxxxxxxxxx>
diff --git a/mm/memory.c b/mm/memory.c[...]
index db558fe43088..68422b606819 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -259,7 +260,10 @@ static inline void free_p4d_range(struct mmu_gather *tlb, pgd_t *pgd,[...]
next = p4d_addr_end(addr, end);
if (p4d_none_or_clear_bad(p4d))
continue;
- free_pud_range(tlb, p4d, addr, next, floor, ceiling);
+ if (unlikely(shared_pud))
+ p4d_clear(p4d);
+ else
+ free_pud_range(tlb, p4d, addr, next, floor, ceiling);
} while (p4d++, addr = next, addr != end);
start &= PGDIR_MASK;
+static void mshare_vm_op_unmap_page_range(struct mmu_gather *tlb,
+ struct vm_area_struct *vma,
+ unsigned long addr, unsigned long end,
+ struct zap_details *details)
+{
+ /*
+ * The msharefs vma is being unmapped. Do not unmap pages in the
+ * mshare region itself.
+ */
+}
Unmapping a VMA has three major phases:
1. unlinking the VMA from the VMA tree
2. removing the VMA contents
3. removing unneeded page tables
The MM subsystem broadly assumes that after phase 2, no stuff is
mapped in the region anymore and therefore changes to the backing file
don't need to TLB-flush this VMA anymore, and unlinks the mapping from
rmaps and such. If munmap() of an mshare region only removes the
mapping of shared page tables in step 3, as implemented here, that
means things like TLB flushes won't be able to discover all
currently-existing mshare mappings of a host MM through rmap walks.
I think it would make more sense to remove the links to shared page
tables in step 2 (meaning in mshare_vm_op_unmap_page_range), just like
hugetlb does, and not modify free_pgtables().
That makes sense. I'll make this change.
Related: I think there needs to be a strategy for preventing walking
of mshare host page tables through an mshare VMA by codepaths relying
on MM/VMA locks, because those locks won't have an effect on the
underlying host MM. For example, I think the only reason fork() is
safe with your proposal is that copy_page_range() skips shared VMAs,
and I think non-fast get_user_pages() could maybe hit use-after-free
of page tables or such?
I guess the only clean strategy for that is to ensure that all
locking-based page table walking code does a check for "is this an
mshare VMA?" and, if yes, either bails immediately or takes extra
locks on the host MM (which could get messy).