Re: [PATCH 2/2] mm/khugepaged: skip shmem with userfaultfd

From: Peter Xu
Date: Wed Feb 15 2023 - 17:49:46 EST


On Tue, Feb 14, 2023 at 04:57:10PM +0900, David Stevens wrote:
> From: David Stevens <stevensd@xxxxxxxxxxxx>
>
> Make sure that collapse_file respects any userfaultfds registered with
> MODE_MISSING. If userspace has any such userfaultfds registered, then
> for any page which it knows to be missing, it may expect a
> UFFD_EVENT_PAGEFAULT. This means collapse_file needs to take care when
> collapsing a shmem range would result in replacing an empty page with a
> THP, so that it doesn't break userfaultfd.
>
> Synchronization when checking for userfaultfds in collapse_file is
> tricky because the mmap locks can't be used to prevent races with the
> registration of new userfaultfds. Instead, we provide synchronization by
> ensuring that userspace cannot observe the fact that pages are missing
> before we check for userfaultfds. Although this allows registration of a
> userfaultfd to race with collapse_file, it ensures that userspace cannot
> observe any pages transition from missing to present after such a race.
> This makes such a race indistinguishable to the collapse occurring
> immediately before the userfaultfd registration.
>
> The first step to provide this synchronization is to stop filling gaps
> during the loop iterating over the target range, since the page cache
> lock can be dropped during that loop. The second step is to fill the
> gaps with XA_RETRY_ENTRY after the page cache lock is acquired the final
> time, to avoid races with accesses to the page cache that only take the
> RCU read lock.
>
> This fix is targeted at khugepaged, but the change also applies to
> MADV_COLLAPSE. MADV_COLLAPSE on a range with a userfaultfd will now
> return EBUSY if there are any missing pages (instead of succeeding on
> shmem and returning EINVAL on anonymous memory). There is also now a
> window during MADV_COLLAPSE where a fault on a missing page will cause
> the syscall to fail with EAGAIN.
>
> The fact that intermediate page cache state can no longer be observed
> before the rollback of a failed collapse is also technically a
> userspace-visible change (via at least SEEK_DATA and SEEK_END), but it
> is exceedingly unlikely that anything relies on being able to observe
> that transient state.
>
> Signed-off-by: David Stevens <stevensd@xxxxxxxxxxxx>
> ---
> mm/khugepaged.c | 66 +++++++++++++++++++++++++++++++++++++++++++------
> 1 file changed, 58 insertions(+), 8 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index b648f1053d95..8c2e2349e883 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -55,6 +55,7 @@ enum scan_result {
> SCAN_CGROUP_CHARGE_FAIL,
> SCAN_TRUNCATED,
> SCAN_PAGE_HAS_PRIVATE,
> + SCAN_PAGE_FILLED,

PS: You may want to also touch SCAN_STATUS in huge_memory.h next time.

> };
>
> #define CREATE_TRACE_POINTS
> @@ -1725,8 +1726,8 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
> * - allocate and lock a new huge page;
> * - scan page cache replacing old pages with the new one
> * + swap/gup in pages if necessary;
> - * + fill in gaps;

IIUC it's not a complete removal, but just moved downwards:

> * + keep old pages around in case rollback is required;
> + * - finalize updates to the page cache;

+ fill in gaps with RETRY entries
+ detect race conditions with userfaultfds

> * - if replacing succeeds:
> * + copy data over;
> * + free old pages;
> @@ -1805,13 +1806,12 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> result = SCAN_TRUNCATED;
> goto xa_locked;
> }
> - xas_set(&xas, index);
> + xas_set(&xas, index + 1);
> }
> if (!shmem_charge(mapping->host, 1)) {
> result = SCAN_FAIL;
> goto xa_locked;
> }
> - xas_store(&xas, hpage);
> nr_none++;
> continue;
> }
> @@ -1970,6 +1970,56 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> put_page(page);
> goto xa_unlocked;
> }
> +
> + if (nr_none) {
> + struct vm_area_struct *vma;
> + int nr_none_check = 0;
> +
> + xas_unlock_irq(&xas);
> + i_mmap_lock_read(mapping);
> + xas_lock_irq(&xas);
> +
> + xas_set(&xas, start);
> + for (index = start; index < end; index++) {
> + if (!xas_next(&xas)) {
> + xas_store(&xas, XA_RETRY_ENTRY);
> + nr_none_check++;
> + }
> + }
> +
> + if (nr_none != nr_none_check) {
> + result = SCAN_PAGE_FILLED;
> + goto immap_locked;
> + }
> +
> + /*
> + * If userspace observed a missing page in a VMA with an armed
> + * userfaultfd, then it might expect a UFFD_EVENT_PAGEFAULT for
> + * that page, so we need to roll back to avoid suppressing such
> + * an event. Any userfaultfds armed after this point will not be
> + * able to observe any missing pages due to the previously
> + * inserted retry entries.
> + */
> + vma_interval_tree_foreach(vma, &mapping->i_mmap, start, start) {
> + if (userfaultfd_missing(vma)) {
> + result = SCAN_EXCEED_NONE_PTE;
> + goto immap_locked;
> + }
> + }
> +
> +immap_locked:
> + i_mmap_unlock_read(mapping);
> + if (result != SCAN_SUCCEED) {
> + xas_set(&xas, start);
> + for (index = start; index < end; index++) {
> + if (xas_next(&xas) == XA_RETRY_ENTRY)
> + xas_store(&xas, NULL);
> + }
> +
> + goto xa_locked;
> + }
> + }
> +

Until here, all look fine to me (ignoring patch 1 for now; assuming the
hpage is always uptodate).

My question is after here we'll release page cache lock again before
try_to_unmap_flush(), but is it safe to keep RETRY entries after releasing
page cache lock? It means other threads can be spinning. I assume page
lock is always safe and sleepable, but not sure about the page cache lock
here.

> nr = thp_nr_pages(hpage);
>
> if (is_shmem)
> @@ -2068,15 +2118,13 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> }
>
> xas_set(&xas, start);
> - xas_for_each(&xas, page, end - 1) {
> + end = index;
> + for (index = start; index < end; index++) {
> + xas_next(&xas);
> page = list_first_entry_or_null(&pagelist,
> struct page, lru);
> if (!page || xas.xa_index < page->index) {
> - if (!nr_none)
> - break;
> nr_none--;
> - /* Put holes back where they were */
> - xas_store(&xas, NULL);
> continue;
> }
>
> @@ -2592,11 +2640,13 @@ static int madvise_collapse_errno(enum scan_result r)
> case SCAN_ALLOC_HUGE_PAGE_FAIL:
> return -ENOMEM;
> case SCAN_CGROUP_CHARGE_FAIL:
> + case SCAN_EXCEED_NONE_PTE:
> return -EBUSY;
> /* Resource temporary unavailable - trying again might succeed */
> case SCAN_PAGE_LOCK:
> case SCAN_PAGE_LRU:
> case SCAN_DEL_PAGE_LRU:
> + case SCAN_PAGE_FILLED:
> return -EAGAIN;
> /*
> * Other: Trying again likely not to succeed / error intrinsic to
> --
> 2.39.1.581.gbfd45094c4-goog
>

--
Peter Xu