Re: [PATCH v5] userfaultfd/shmem: fix MCOPY_ATOMIC_CONTINUE behavior

From: Hugh Dickins
Date: Thu Apr 08 2021 - 23:08:31 EST


On Thu, 8 Apr 2021, Axel Rasmussen wrote:
> On Tue, Apr 6, 2021 at 4:49 PM Peter Xu <peterx@xxxxxxxxxx> wrote:
> > On Mon, Apr 05, 2021 at 10:19:17AM -0700, Axel Rasmussen wrote:
...
> > > --- a/mm/userfaultfd.c
> > > +++ b/mm/userfaultfd.c
...
> > > +
> > > + if (is_file_backed) {
> > > + /* The shmem MAP_PRIVATE case requires checking the i_size */
> > > + inode = dst_vma->vm_file->f_inode;
> > > + offset = linear_page_index(dst_vma, dst_addr);
> > > + max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
> > > + ret = -EFAULT;
> > > + if (unlikely(offset >= max_off))
> > > + goto out_unlock;
> >
> > Frankly I don't really know why this must be put into pgtable lock.. Since if
> > not required then it can be moved into UFFDIO_COPY path, as CONTINUE doesn't
> > need it iiuc. Just raise it up as a pure question.
>
> It's not clear to me either. shmem_getpage_gfp() does check this twice
> kinda like we're doing, but it doesn't ever touch the PTL. What it
> seems to be worried about is, what happens if a concurrent
> FALLOC_FL_PUNCH_HOLE happens somewhere in the middle of whatever
> manipulation we're doing? From looking at shmem_fallocate(), I think
> the basic point is that truncation happens while "inode_lock(inode)"
> is held, but neither shmem_mcopy_atomic_pte() or the new
> mcopy_atomic_install_ptes() take that lock.
>
> I'm a bit hesitant to just remove it, run some tests, and then declare
> victory, because it seems plausible it's there to catch some
> semi-hard-to-induce race. I'm not sure how to prove that *isn't*
> needed, so my inclination is to just keep it?
>
> I'll send a series addressing the feedback so far this afternoon, and
> I'll leave this alone for now - at least, it doesn't seem to hurt
> anything. Maybe Hugh or someone else has some more advice about it. If
> so, I'm happy to remove it in a follow-up.

It takes some thinking about, but the i_size check is required to be
under the pagetable lock, for the MAP_PRIVATE UFFDIO_COPY path, where
it is inserting an anonymous page into the file-backed vma (skipping
actually inserting a page into page cache, as an ordinary fault would).

Not because of FALLOC_FL_PUNCH_HOLE (which makes no change to i_size;
and it's okay if a race fills in the hole immediately afterwards),
but because of truncation (which must remove all beyond i_size).

In the MAP_SHARED case, with a locked page inserted into page cache,
the page lock is enough to exclude concurrent truncation. But even
in that case the second i_size check (I'm looking at 5.12-rc's
shmem_mfill_atomic_pte(), rather than recent patches which might differ)
is required: because the first i_size check was done before the page
became visible in page cache, so a concurrent truncation could miss it).

Maybe that first check is redundant, though I'm definitely for doing it;
or maybe shmem_add_to_page_cache() would be better if it made that check
itself, under xas_lock (I think the reason it does not is historical).
The second check, in the MAP_SHARED case, does not need to be under
pagetable lock - the page lock on the page cache page is enough -
but probably Andrea placed it there to resemble the anonymous case.

You might then question, how come there is no i_size check in all of
mm/memory.c, where ordinary faulting is handled. I'll answer that
the pte_same() check, under pagetable lock in wp_page_copy(), is
where the equivalent to userfaultfd's MAP_PRIVATE UFFDIO_COPY check
is made: if the page cache page has already been truncated, that pte
will have been cleared.

Or, if the page cache page is truncated an instant after wp_page_copy()
drops page table lock, then the unmap_mapping_range(,,, even_cows = 1)
which follows truncation has to clean it up. Er, does that mean that
the i_size check I started off insisting is required, actually is not
required? Um, maybe, but let's just keep it and say goodnight!

Hugh