Re: [PATCH RFC v2] mm: use per_vma lock for MADV_DONTNEED

From: Barry Song
Date: Tue Jun 03 2025 - 03:52:07 EST


On Tue, Jun 3, 2025 at 2:56 AM Jann Horn <jannh@xxxxxxxxxx> wrote:
>
> On Sat, May 31, 2025 at 12:00 AM Barry Song <21cnbao@xxxxxxxxx> wrote:
> > On Fri, May 30, 2025 at 10:07 PM Jann Horn <jannh@xxxxxxxxxx> wrote:
> > > On Fri, May 30, 2025 at 12:44 PM Barry Song <21cnbao@xxxxxxxxx> wrote:
> > > > @@ -1714,19 +1770,24 @@ static int madvise_do_behavior(struct mm_struct *mm,
> > > > unsigned long start, size_t len_in,
> > > > struct madvise_behavior *madv_behavior)
> > > > {
> > > > + struct vm_area_struct *vma = madv_behavior->vma;
> > > > int behavior = madv_behavior->behavior;
> > > > +
> > > > struct blk_plug plug;
> > > > unsigned long end;
> > > > int error;
> > > >
> > > > if (is_memory_failure(behavior))
> > > > return madvise_inject_error(behavior, start, start + len_in);
> > > > - start = untagged_addr_remote(mm, start);
> > > > + start = untagged_addr(start);
> > >
> > > Why is this okay? I see that X86's untagged_addr_remote() asserts that
> > > the mmap lock is held, which is no longer the case here with your
> > > patch, but untagged_addr() seems wrong here, since we can be operating
> > > on another process. I think especially on X86 with 5-level paging and
> > > LAM, there can probably be cases where address bits are used for part
> > > of the virtual address in one task while they need to be masked off in
> > > another task?
> > >
> > > I wonder if you'll have to refactor X86 and Risc-V first to make this
> > > work... ideally by making sure that their address tagging state
> > > updates are atomic and untagged_area_remote() works locklessly.
> >
> > If possible, can we try to avoid this at least for this stage? We all
> > agree that
> > a per-VMA lock for DONTNEED is long overdue. The main goal of the patch
> > is to drop the mmap_lock for high-frequency madvise operations like
> > MADV_DONTNEED and potentially MADV_FREE. For these two cases, it's highly
> > unlikely that one process would be managing the memory of another. In v2,
> > we're modifying common code, which is why we ended up here.
> >
> > We could consider doing:
> >
> > if (current->mm == mm)
> > untagged_addr(start);
> > else
> > untagged_addr_remote(mm, start);
>
> Ah, in other words, basically make it so that for now we don't use VMA
> locking on remote processes, and then we can have two different paths
> here for "local operation" and "MM-locked operation"? That's not very
> pretty but from a correctness perspective I'm fine with that.

Right, that’s exactly what I mean—we may hold off on remote `madvise` for
now unless we can find a straightforward way to fix up the architecture
code, especially since the tagging implementations in x86 and RISC-V are
quite confusing.

It’s particularly tricky for RISC-V, which supports two different PMLEN
values simultaneously. Resolving the untagging issue will likely require
extensive discussions with both the x86 and RISC-V architecture teams.

long set_tagged_addr_ctrl(struct task_struct *task, unsigned long arg)
{
...
/*
* Prefer the smallest PMLEN that satisfies the user's request,
* in case choosing a larger PMLEN has a performance impact.
*/
pmlen = FIELD_GET(PR_PMLEN_MASK, arg);
if (pmlen == PMLEN_0) {
pmm = ENVCFG_PMM_PMLEN_0;
} else if (pmlen <= PMLEN_7 && have_user_pmlen_7) {
pmlen = PMLEN_7;
pmm = ENVCFG_PMM_PMLEN_7;
} else if (pmlen <= PMLEN_16 && have_user_pmlen_16) {
pmlen = PMLEN_16;
pmm = ENVCFG_PMM_PMLEN_16;
} else {
return -EINVAL;
}
...
}

It’s strange that x86’s LAM U48 was rejected, while RISC-V’s PMLEN values
of 7 and 16 were both accepted.

Another reason is that we’re not too concerned about remote `madvise`
at this stage, as our current focus is on high-frequency cases like
`MADV_DONTNEED`, and possibly `MADV_FREE`.

Thanks
Barry