Re: [PATCH v4 1/2] mm/uffd: UFFD_FEATURE_WP_UNPOPULATED

From: David Hildenbrand
Date: Mon Mar 20 2023 - 06:22:37 EST



(1) With huge page disabled
echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
./uffd_wp_perf
Test DEFAULT: 4
Test PRE-READ: 1111453 (pre-fault 1101011)
Test MADVISE: 278276 (pre-fault 266378)

Thinking about it, I guess the biggest slowdown here is the "one fake pagefault at a time" handling.

Test WP-UNPOPULATE: 11712

(2) With Huge page enabled
echo always > /sys/kernel/mm/transparent_hugepage/enabled
./uffd_wp_perf
Test DEFAULT: 4
Test PRE-READ: 22521 (pre-fault 22348)
Test MADVISE: 4909 (pre-fault 4743)
Test WP-UNPOPULATE: 14448

There'll be a great perf boost for no-thp case, while for thp enabled with
extreme case of all-thp-zero WP_UNPOPULATED can be slower than MADVISE, but
that's low possibility in reality, also the overhead was not reduced but
postponed until a follow up write on any huge zero thp, so potentially it
is faster by making the follow up writes slower.

[1] https://lore.kernel.org/all/20210401092226.102804-4-andrey.gruzdev@xxxxxxxxxxxxx/
[2] https://lore.kernel.org/all/Y+v2HJ8+3i%2FKzDBu@x1n/
[3] https://lore.kernel.org/all/d0eb0a13-16dc-1ac1-653a-78b7273781e3@xxxxxxxxxxxxx/
[4] https://github.com/xzpeter/clibs/blob/master/uffd-test/uffd-wp-perf.c

Signed-off-by: Peter Xu <peterx@xxxxxxxxxx>
---
Documentation/admin-guide/mm/userfaultfd.rst | 17 ++++++
fs/userfaultfd.c | 16 ++++++
include/linux/mm_inline.h | 6 +++
include/linux/userfaultfd_k.h | 23 ++++++++
include/uapi/linux/userfaultfd.h | 10 +++-
mm/memory.c | 56 +++++++++++++++-----
mm/mprotect.c | 51 ++++++++++++++----
7 files changed, 154 insertions(+), 25 deletions(-)

diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst
index 7dc823b56ca4..c86b56c95ea6 100644
--- a/Documentation/admin-guide/mm/userfaultfd.rst
+++ b/Documentation/admin-guide/mm/userfaultfd.rst
@@ -219,6 +219,23 @@ former will have ``UFFD_PAGEFAULT_FLAG_WP`` set, the latter
you still need to supply a page when ``UFFDIO_REGISTER_MODE_MISSING`` was
used.
+Userfaultfd write-protect mode currently behave differently on none ptes
+(when e.g. page is missing) over different types of memories.
+
+For anonymous memory, ``ioctl(UFFDIO_WRITEPROTECT)`` will ignore none ptes
+(e.g. when pages are missing and not populated). For file-backed memories
+like shmem and hugetlbfs, none ptes will be write protected just like a
+present pte. In other words, there will be a userfaultfd write fault
+message generated when writting to a missing page on file typed memories,

s/writting/writing/

+as long as the page range was write-protected before. Such a message will
+not be generated on anonymous memories by default.
+
+If the application wants to be able to write protect none ptes on anonymous
+memory, one can pre-populate the memory with e.g. MADV_POPULATE_READ. On
+newer kernels, one can also detect the feature UFFD_FEATURE_WP_UNPOPULATED
+and set the feature bit in advance to make sure none ptes will also be
+write protected even upon anonymous memory.
+

[...]

/*
* A number of key systems in x86 including ioremap() rely on the assumption
@@ -1350,6 +1364,10 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
unsigned long addr, pte_t *pte,
struct zap_details *details, pte_t pteval)
{
+ /* Zap on anonymous always means dropping everything */
+ if (vma_is_anonymous(vma))
+ return;
+
if (zap_drop_file_uffd_wp(details))
return;
@@ -1456,8 +1474,12 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
continue;
rss[mm_counter(page)]--;
} else if (pte_marker_entry_uffd_wp(entry)) {
- /* Only drop the uffd-wp marker if explicitly requested */
- if (!zap_drop_file_uffd_wp(details))
+ /*
+ * For anon: always drop the marker; for file: only
+ * drop the marker if explicitly requested.
+ */

So MADV_DONTNEED a pte marker in an anonymous VMA will always remove that marker. Is that the same handling as for MADV_DONTNEED on shmem or on fallocate(PUNCHHOLE) on shmem?

+ if (!vma_is_anonymous(vma) &&
+ !zap_drop_file_uffd_wp(details))
continue;

Maybe it would be nicer to have a zap_drop_uffd_wp_marker(vma, details) and have the comment in there. Especially because of the other hunk above.

So zap_drop_file_uffd_wp(details) -> zap_drop_uffd_wp_marker(vma, details) and move the anon handling + comment in there.


} else if (is_hwpoison_entry(entry) ||
is_swapin_error_entry(entry)) {
@@ -3624,6 +3646,14 @@ static vm_fault_t pte_marker_clear(struct vm_fault *vmf)
return 0;
}
+static vm_fault_t do_pte_missing(struct vm_fault *vmf)
+{
+ if (vma_is_anonymous(vmf->vma))
+ return do_anonymous_page(vmf);
+ else
+ return do_fault(vmf);

No need for the "else" statement.

+}
+
/*
* This is actually a page-missing access, but with uffd-wp special pte
* installed. It means this pte was wr-protected before being unmapped.
@@ -3634,11 +3664,10 @@ static vm_fault_t pte_marker_handle_uffd_wp(struct vm_fault *vmf)
* Just in case there're leftover special ptes even after the region
* got unregistered - we can simply clear them.
*/
- if (unlikely(!userfaultfd_wp(vmf->vma) || vma_is_anonymous(vmf->vma)))
+ if (unlikely(!userfaultfd_wp(vmf->vma)))
return pte_marker_clear(vmf);
- /* do_fault() can handle pte markers too like none pte */
- return do_fault(vmf);
+ return do_pte_missing(vmf);
}

[...]

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 231929f119d9..455f7051098f 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -276,7 +276,15 @@ static long change_pte_range(struct mmu_gather *tlb,
} else {
/* It must be an none page, or what else?.. */
WARN_ON_ONCE(!pte_none(oldpte));
- if (unlikely(uffd_wp && !vma_is_anonymous(vma))) {
+
+ /*
+ * Nobody plays with any none ptes besides
+ * userfaultfd when applying the protections.
+ */
+ if (likely(!uffd_wp))
+ continue;
+
+ if (userfaultfd_wp_use_markers(vma)) {
/*
* For file-backed mem, we need to be able to
* wr-protect a none pte, because even if the
@@ -320,23 +328,46 @@ static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd)
return 0;
}
-/* Return true if we're uffd wr-protecting file-backed memory, or false */
+/*
+ * Return true if we want to split huge thps in change protection

"huge thps" sounds redundant. "if we want to PTE-map a huge PMD" ?

+ * procedure, false otherwise.


In general,

Acked-by: David Hildenbrand <david@xxxxxxxxxx>

--
Thanks,

David / dhildenb