Re: [PATCH v4 2/3] mm/khugepaged: skip shmem with userfaultfd

From: Peter Xu
Date: Tue Feb 21 2023 - 17:13:45 EST


On Fri, Feb 17, 2023 at 05:54:38PM +0900, David Stevens wrote:
> From: David Stevens <stevensd@xxxxxxxxxxxx>
>
> Make sure that collapse_file respects any userfaultfds registered with
> MODE_MISSING. If userspace has any such userfaultfds registered, then
> for any page which it knows to be missing, it may expect a
> UFFD_EVENT_PAGEFAULT. This means collapse_file needs to be careful when
> collapsing a shmem range would result in replacing an empty page with a
> THP, to avoid breaking userfaultfd.
>
> Synchronization when checking for userfaultfds in collapse_file is
> tricky because the mmap locks can't be used to prevent races with the
> registration of new userfaultfds. Instead, we provide synchronization by
> ensuring that userspace cannot observe the fact that pages are missing
> before we check for userfaultfds. Although this allows registration of a
> userfaultfd to race with collapse_file, it ensures that userspace cannot
> observe any pages transition from missing to present after such a race
> occurs. This makes such a race indistinguishable to the collapse
> occurring immediately before the userfaultfd registration.
>
> The first step to provide this synchronization is to stop filling gaps
> during the loop iterating over the target range, since the page cache
> lock can be dropped during that loop. The second step is to fill the
> gaps with XA_RETRY_ENTRY after the page cache lock is acquired the final
> time, to avoid races with accesses to the page cache that only take the
> RCU read lock.
>
> The fact that we don't fill holes during the initial iteration means
> that collapse_file now has to handle faults occurring during the
> collapse. This is done by re-validating the number of missing pages
> after acquiring the page cache lock for the final time.
>
> This fix is targeted at khugepaged, but the change also applies to
> MADV_COLLAPSE. MADV_COLLAPSE on a range with a userfaultfd will now
> return EBUSY if there are any missing pages (instead of succeeding on
> shmem and returning EINVAL on anonymous memory). There is also now a
> window during MADV_COLLAPSE where a fault on a missing page will cause
> the syscall to fail with EAGAIN.
>
> The fact that intermediate page cache state can no longer be observed
> before the rollback of a failed collapse is also technically a
> userspace-visible change (via at least SEEK_DATA and SEEK_END), but it
> is exceedingly unlikely that anything relies on being able to observe
> that transient state.
>
> Signed-off-by: David Stevens <stevensd@xxxxxxxxxxxx>

It'll be great to have another eye looking, but... AFAICT this works for
us. Thanks David, this is better than what I suggested.

Acked-by: Peter Xu <peterx@xxxxxxxxxx>

--
Peter Xu