Re: [PATCH v2 12/20] mm/mshare: prepare for page table sharing support
From: Jann Horn
Date: Mon Jun 02 2025 - 11:40:18 EST
On Fri, May 30, 2025 at 6:42 PM Anthony Yznaga
<anthony.yznaga@xxxxxxxxxx> wrote:
> On 5/30/25 7:56 AM, Jann Horn wrote:
> > On Fri, Apr 4, 2025 at 4:18 AM Anthony Yznaga <anthony.yznaga@xxxxxxxxxx> wrote:
> >> In preparation for enabling the handling of page faults in an mshare
> >> region provide a way to link an mshare shared page table to a process
> >> page table and otherwise find the actual vma in order to handle a page
> >> fault. Modify the unmap path to ensure that page tables in mshare regions
> >> are unlinked and kept intact when a process exits or an mshare region
> >> is explicitly unmapped.
> >>
> >> Signed-off-by: Khalid Aziz <khalid@xxxxxxxxxx>
> >> Signed-off-by: Matthew Wilcox (Oracle) <willy@xxxxxxxxxxxxx>
> >> Signed-off-by: Anthony Yznaga <anthony.yznaga@xxxxxxxxxx>
> > [...]
> >> diff --git a/mm/memory.c b/mm/memory.c
> >> index db558fe43088..68422b606819 100644
> >> --- a/mm/memory.c
> >> +++ b/mm/memory.c
> > [...]
> >> @@ -259,7 +260,10 @@ static inline void free_p4d_range(struct mmu_gather *tlb, pgd_t *pgd,
> >> next = p4d_addr_end(addr, end);
> >> if (p4d_none_or_clear_bad(p4d))
> >> continue;
> >> - free_pud_range(tlb, p4d, addr, next, floor, ceiling);
> >> + if (unlikely(shared_pud))
> >> + p4d_clear(p4d);
> >> + else
> >> + free_pud_range(tlb, p4d, addr, next, floor, ceiling);
> >> } while (p4d++, addr = next, addr != end);
> >>
> >> start &= PGDIR_MASK;
> > [...]
> >> +static void mshare_vm_op_unmap_page_range(struct mmu_gather *tlb,
> >> + struct vm_area_struct *vma,
> >> + unsigned long addr, unsigned long end,
> >> + struct zap_details *details)
> >> +{
> >> + /*
> >> + * The msharefs vma is being unmapped. Do not unmap pages in the
> >> + * mshare region itself.
> >> + */
> >> +}
> >
> > Unmapping a VMA has three major phases:
> >
> > 1. unlinking the VMA from the VMA tree
> > 2. removing the VMA contents
> > 3. removing unneeded page tables
> >
> > The MM subsystem broadly assumes that after phase 2, no stuff is
> > mapped in the region anymore and therefore changes to the backing file
> > don't need to TLB-flush this VMA anymore, and unlinks the mapping from
> > rmaps and such. If munmap() of an mshare region only removes the
> > mapping of shared page tables in step 3, as implemented here, that
> > means things like TLB flushes won't be able to discover all
> > currently-existing mshare mappings of a host MM through rmap walks.
> >
> > I think it would make more sense to remove the links to shared page
> > tables in step 2 (meaning in mshare_vm_op_unmap_page_range), just like
> > hugetlb does, and not modify free_pgtables().
>
> That makes sense. I'll make this change.
Related: I think there needs to be a strategy for preventing walking
of mshare host page tables through an mshare VMA by codepaths relying
on MM/VMA locks, because those locks won't have an effect on the
underlying host MM. For example, I think the only reason fork() is
safe with your proposal is that copy_page_range() skips shared VMAs,
and I think non-fast get_user_pages() could maybe hit use-after-free
of page tables or such?
I guess the only clean strategy for that is to ensure that all
locking-based page table walking code does a check for "is this an
mshare VMA?" and, if yes, either bails immediately or takes extra
locks on the host MM (which could get messy).