Re: [PATCH v7 1/4] userfaultfd: Add UFFD WP Async support

From: Muhammad Usama Anjum
Date: Thu Jan 19 2023 - 10:10:27 EST


Hi Peter,

Thank you so much for reviewing.

On 1/18/23 9:54 PM, Peter Xu wrote:
> Hi, Muhammad,
>
> On Mon, Jan 09, 2023 at 11:45:16AM +0500, Muhammad Usama Anjum wrote:
>> Add new WP Async mode (UFFDIO_WRITEPROTECT_MODE_ASYNC_WP) which resolves
>> the page faults on its own. It can be used to track that which pages have
>> been written to from the time the pages were write protected. It is very
>> efficient way to track the changes as uffd is by nature pte/pmd based.
>>
>> UFFD WP (UFFDIO_WRITEPROTECT_MODE_WP) sends the page faults to the
>> userspace where the pages which have been written-to can be tracked. But
>> it is not efficient. This is why this async version is being added.
>> After setting the WP Async, the pages which have been written to can be
>> found in the pagemap file or information can be obtained from the
>> PAGEMAP_IOCTL (see next patches).
>>
>> Signed-off-by: Muhammad Usama Anjum <usama.anjum@xxxxxxxxxxxxx>
>> ---
>> fs/userfaultfd.c | 150 +++++++++++++++++--------------
>> include/uapi/linux/userfaultfd.h | 6 ++
>> 2 files changed, 90 insertions(+), 66 deletions(-)
>>
>> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
>> index 15a5bf765d43..be5e10d15058 100644
>> --- a/fs/userfaultfd.c
>> +++ b/fs/userfaultfd.c
>> @@ -69,6 +69,7 @@ struct userfaultfd_ctx {
>> unsigned int features;
>> /* released */
>> bool released;
>> + bool async;
>
> Let's just make it a feature flag,
>
> UFFD_FEATURE_WP_ASYNC
This would really make things easier. Thank you so much for suggesting it.

>
>> /* memory mappings are changing because of non-cooperative event */
>> atomic_t mmap_changing;
>> /* mm with one ore more vmas attached to this userfaultfd_ctx */
>> @@ -497,80 +498,93 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
>>
>> /* take the reference before dropping the mmap_lock */
>> userfaultfd_ctx_get(ctx);
>> + if (ctx->async) {
>
> Firstly, please consider not touching the existing code/indent as much as
> what this patch did. Hopefully we can keep the major part of sync uffd be
> there with its git log, it also helps reviewing your code. You can add the
> async block before that, handle the fault and return just earlier.
This is possible. Will do in next revision.

>
> And, I think this is a bit too late because we're going to return with
> VM_FAULT_RETRY here, while maybe we don't need to retry at all here because
> we're going to resolve the page fault immediately.
>
> I assume you added this because you wanted userfaultfd_ctx_get() to make
> sure the uffd context will not go away from under us, but it's not needed
> if we're still holding the mmap read lock. I'd expect for async mode we
> don't really need to release it at all.
I'll have to check the what should be returned here. We should return
something which shows that the fault has been resolved.

>
>> + // Resolve page fault of this page
>
> Please use "/* ... */" as that's the common pattern of commenting in the
> Linux kernel, at least what I see in mm/.
Will do.

>
>> + unsigned long addr = (ctx->features & UFFD_FEATURE_EXACT_ADDRESS) ?
>> + vmf->real_address : vmf->address;
>> + struct vm_area_struct *dst_vma = find_vma(ctx->mm, addr);
>> + size_t s = PAGE_SIZE;
>
> This is weird - if we want async uffd-wp, let's consider huge page from the
> 1st day.
>
>> +
>> + if (dst_vma->vm_flags & VM_HUGEPAGE) {
>
> VM_HUGEPAGE is only a hint. It doesn't mean this page is always a huge
> page. For anon, we can have thp wr-protected as a whole, not happening for
> !anon because we'll split already.
>
> For anon, if a write happens to a thp being uffd-wp-ed, we'll keep that pmd
> wr-protected and report the uffd message. The pmd split happens when the
> user invokes UFFDIO_WRITEPROTECT on the small page. I think it'll stop
> working for async uffd-wp because we're going to resolve the page faults
> right away.
>
> So for async uffd-wp (note: this will be different from hugetlb), you may
> want to consider having a pre-requisite patch to change wp_huge_pmd()
> behavior: rather than calling handle_userfault(), IIUC you can also just
> fallback to the split path right below (__split_huge_pmd) so the thp will
> split now even before the uffd message is generated.
I'll make the changes and make this. I wasn't aware that the thp is being
broken in the UFFD WP. At this time, I'm not sure if thp will be handled by
handle_userfault() in one go. Probably it will as the length is stored in
the vmf.

>
> I think it should be transparent to the user and it'll start working for
> you with async uffd-wp here, because it means when reaching
> handle_userfault, it should not be possible to have thp at all since they
> should have all split up.
>
>> + s = HPAGE_SIZE;
>> + addr &= HPAGE_MASK;
>> + }
>>
>> - init_waitqueue_func_entry(&uwq.wq, userfaultfd_wake_function);
>> - uwq.wq.private = current;
>> - uwq.msg = userfault_msg(vmf->address, vmf->real_address, vmf->flags,
>> - reason, ctx->features);
>> - uwq.ctx = ctx;
>> - uwq.waken = false;
>> -
>> - blocking_state = userfaultfd_get_blocking_state(vmf->flags);
>> + ret = mwriteprotect_range(ctx->mm, addr, s, false, &ctx->mmap_changing);
>
> This is an overkill - we're pretty sure it's a single page, no need to call
> a range function here.
Probably change_pte_range() should be used here to directly remove the WP here?

>
>> + } else {
>> + init_waitqueue_func_entry(&uwq.wq, userfaultfd_wake_function);
>> + uwq.wq.private = current;
>> + uwq.msg = userfault_msg(vmf->address, vmf->real_address, vmf->flags,
>> + reason, ctx->features);
>> + uwq.ctx = ctx;
>> + uwq.waken = false;
>>
>> - /*
>> - * Take the vma lock now, in order to safely call
>> - * userfaultfd_huge_must_wait() later. Since acquiring the
>> - * (sleepable) vma lock can modify the current task state, that
>> - * must be before explicitly calling set_current_state().
>> - */
>> - if (is_vm_hugetlb_page(vma))
>> - hugetlb_vma_lock_read(vma);
>> + blocking_state = userfaultfd_get_blocking_state(vmf->flags);
>>
>> - spin_lock_irq(&ctx->fault_pending_wqh.lock);
>> - /*
>> - * After the __add_wait_queue the uwq is visible to userland
>> - * through poll/read().
>> - */
>> - __add_wait_queue(&ctx->fault_pending_wqh, &uwq.wq);
>> - /*
>> - * The smp_mb() after __set_current_state prevents the reads
>> - * following the spin_unlock to happen before the list_add in
>> - * __add_wait_queue.
>> - */
>> - set_current_state(blocking_state);
>> - spin_unlock_irq(&ctx->fault_pending_wqh.lock);
>> + /*
>> + * Take the vma lock now, in order to safely call
>> + * userfaultfd_huge_must_wait() later. Since acquiring the
>> + * (sleepable) vma lock can modify the current task state, that
>> + * must be before explicitly calling set_current_state().
>> + */
>> + if (is_vm_hugetlb_page(vma))
>> + hugetlb_vma_lock_read(vma);
>>
>> - if (!is_vm_hugetlb_page(vma))
>> - must_wait = userfaultfd_must_wait(ctx, vmf->address, vmf->flags,
>> - reason);
>> - else
>> - must_wait = userfaultfd_huge_must_wait(ctx, vma,
>> - vmf->address,
>> - vmf->flags, reason);
>> - if (is_vm_hugetlb_page(vma))
>> - hugetlb_vma_unlock_read(vma);
>> - mmap_read_unlock(mm);
>> + spin_lock_irq(&ctx->fault_pending_wqh.lock);
>> + /*
>> + * After the __add_wait_queue the uwq is visible to userland
>> + * through poll/read().
>> + */
>> + __add_wait_queue(&ctx->fault_pending_wqh, &uwq.wq);
>> + /*
>> + * The smp_mb() after __set_current_state prevents the reads
>> + * following the spin_unlock to happen before the list_add in
>> + * __add_wait_queue.
>> + */
>> + set_current_state(blocking_state);
>> + spin_unlock_irq(&ctx->fault_pending_wqh.lock);
>>
>> - if (likely(must_wait && !READ_ONCE(ctx->released))) {
>> - wake_up_poll(&ctx->fd_wqh, EPOLLIN);
>> - schedule();
>> - }
>> + if (!is_vm_hugetlb_page(vma))
>> + must_wait = userfaultfd_must_wait(ctx, vmf->address, vmf->flags,
>> + reason);
>> + else
>> + must_wait = userfaultfd_huge_must_wait(ctx, vma,
>> + vmf->address,
>> + vmf->flags, reason);
>> + if (is_vm_hugetlb_page(vma))
>> + hugetlb_vma_unlock_read(vma);
>> + mmap_read_unlock(mm);
>> +
>> + if (likely(must_wait && !READ_ONCE(ctx->released))) {
>> + wake_up_poll(&ctx->fd_wqh, EPOLLIN);
>> + schedule();
>> + }
>>
>> - __set_current_state(TASK_RUNNING);
>> + __set_current_state(TASK_RUNNING);
>>
>> - /*
>> - * Here we race with the list_del; list_add in
>> - * userfaultfd_ctx_read(), however because we don't ever run
>> - * list_del_init() to refile across the two lists, the prev
>> - * and next pointers will never point to self. list_add also
>> - * would never let any of the two pointers to point to
>> - * self. So list_empty_careful won't risk to see both pointers
>> - * pointing to self at any time during the list refile. The
>> - * only case where list_del_init() is called is the full
>> - * removal in the wake function and there we don't re-list_add
>> - * and it's fine not to block on the spinlock. The uwq on this
>> - * kernel stack can be released after the list_del_init.
>> - */
>> - if (!list_empty_careful(&uwq.wq.entry)) {
>> - spin_lock_irq(&ctx->fault_pending_wqh.lock);
>> /*
>> - * No need of list_del_init(), the uwq on the stack
>> - * will be freed shortly anyway.
>> + * Here we race with the list_del; list_add in
>> + * userfaultfd_ctx_read(), however because we don't ever run
>> + * list_del_init() to refile across the two lists, the prev
>> + * and next pointers will never point to self. list_add also
>> + * would never let any of the two pointers to point to
>> + * self. So list_empty_careful won't risk to see both pointers
>> + * pointing to self at any time during the list refile. The
>> + * only case where list_del_init() is called is the full
>> + * removal in the wake function and there we don't re-list_add
>> + * and it's fine not to block on the spinlock. The uwq on this
>> + * kernel stack can be released after the list_del_init.
>> */
>> - list_del(&uwq.wq.entry);
>> - spin_unlock_irq(&ctx->fault_pending_wqh.lock);
>> + if (!list_empty_careful(&uwq.wq.entry)) {
>> + spin_lock_irq(&ctx->fault_pending_wqh.lock);
>> + /*
>> + * No need of list_del_init(), the uwq on the stack
>> + * will be freed shortly anyway.
>> + */
>> + list_del(&uwq.wq.entry);
>> + spin_unlock_irq(&ctx->fault_pending_wqh.lock);
>> + }
>> }
>> -
>> /*
>> * ctx may go away after this if the userfault pseudo fd is
>> * already released.
>> @@ -1861,11 +1875,14 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
>> return ret;
>>
>> if (uffdio_wp.mode & ~(UFFDIO_WRITEPROTECT_MODE_DONTWAKE |
>> - UFFDIO_WRITEPROTECT_MODE_WP))
>> + UFFDIO_WRITEPROTECT_MODE_WP |
>> + UFFDIO_WRITEPROTECT_MODE_ASYNC_WP))
>> return -EINVAL;
>>
>> - mode_wp = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP;
>> + mode_wp = uffdio_wp.mode & (UFFDIO_WRITEPROTECT_MODE_WP |
>> + UFFDIO_WRITEPROTECT_MODE_ASYNC_WP);
>> mode_dontwake = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE;
>> + ctx->async = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_ASYNC_WP;
>
> Please no.. ctx attributes shouldn't be easily changed by a single ioctl.
>
> I suggest we have a new feature bit as I mentioned above (say,
> UFFD_FEATURE_WP_ASYNC), set it once with UFFDIO_API and it should apply to
> the whole lifecycle of this uffd handle. That flag should (something I can
> quickly think of):
>
> - Have effect only if the uffd will be registered with WP mode (of
> course) or ignored in any other modes,
>
> - Should fail any attempts of UFFDIO_WRITEPROTECT with wp=false on this
> uffd handle because with async faults no page fault resolution needed
> from userspace,
>
> - Should apply to any region registered with this uffd ctx, so it's
> exclusively used with sync uffd-wp mode.
All of these are necesary and must be done to consolidate the interface of
UFFD. Agreed!

>
> Then when the app wants to wr-protect in async mode, it simply goes ahead
> with UFFDIO_WRITEPROTECT(wp=true), it'll happen exactly the same as when it
> was sync mode. It's only the pf handling procedure that's different (along
> with how the fault is reported - rather than as a message but it'll be
> consolidated into the soft-dirty bit).
PF handling will resovle the fault after un-setting the _PAGE_*_UFFD_WP on
the page. I'm not changing the soft-dirty bit. It is too delicate (if you
get the joke).

>
>>
>> if (mode_wp && mode_dontwake)
>> return -EINVAL;
>> @@ -2126,6 +2143,7 @@ static int new_userfaultfd(int flags)
>> ctx->flags = flags;
>> ctx->features = 0;
>> ctx->released = false;
>> + ctx->async = false;
>> atomic_set(&ctx->mmap_changing, 0);
>> ctx->mm = current->mm;
>> /* prevent the mm struct to be freed */
>> diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
>> index 005e5e306266..b89665653861 100644
>> --- a/include/uapi/linux/userfaultfd.h
>> +++ b/include/uapi/linux/userfaultfd.h
>> @@ -284,6 +284,11 @@ struct uffdio_writeprotect {
>> * UFFDIO_WRITEPROTECT_MODE_DONTWAKE: set the flag to avoid waking up
>> * any wait thread after the operation succeeds.
>> *
>> + * UFFDIO_WRITEPROTECT_MODE_ASYNC_WP: set the flag to write protect a
>> + * range, the flag is unset automatically when the page is written.
>> + * This is used to track which pages have been written to from the
>> + * time the memory was write protected.
>> + *
>> * NOTE: Write protecting a region (WP=1) is unrelated to page faults,
>> * therefore DONTWAKE flag is meaningless with WP=1. Removing write
>> * protection (WP=0) in response to a page fault wakes the faulting
>> @@ -291,6 +296,7 @@ struct uffdio_writeprotect {
>> */
>> #define UFFDIO_WRITEPROTECT_MODE_WP ((__u64)1<<0)
>> #define UFFDIO_WRITEPROTECT_MODE_DONTWAKE ((__u64)1<<1)
>> +#define UFFDIO_WRITEPROTECT_MODE_ASYNC_WP ((__u64)1<<2)
>> __u64 mode;
>> };
>>
>> --
>> 2.30.2
>>
>

I should have added Suggested-by: Peter Xy <peterx@xxxxxxxxxx> to this
patch. I'll add in the next revision if you don't object.

I've started working on next revision. I'll reply to other highly valuable
review emails a bit later.

--
BR,
Muhammad Usama Anjum