Re: [PATCH] mm/khugepaged: fix collapse_pte_mapped_thp() to allow anon_vma

From: David Hildenbrand
Date: Mon Jan 09 2023 - 04:00:11 EST



Side note: set_huge_pmd() wins the award of "ugliest mm function of early
2023". I was briefly concerned how do_set_pmd() decides whether the PMD can be
writable or not. Turns out it's communicated via vm_fault->flags. Just
horrible.

My first Linux award! :) At least it's not "worst mm security issue of
early 2023". I'll take it!

Good that you're not taking my words the wrong way.

MADV_COLLAPSE is a very useful feature (especially also for THP tests [1]). I wish I could have looked at some of the patches earlier. But we cannot wait forever to get something merged, otherwise we'd never get bigger changes upstream.

... so there is plenty of time left in 2023 to cleanup khugepaged.c :P


[1] https://lkml.kernel.org/r/20230104144905.460075-1-david@xxxxxxxxxx

[...]


For example: why even *care* about the complexity of installing a PMD in
collapse_pte_mapped_thp() using set_huge_pmd() just for MADV_COLLAPSE?

Sure, we avoid a single page fault afterwards, but is this *really*
worth the extra code here? I mean, after we installed the PMD, the page
could just get reclaimed either way, so there is no guarantee that we
have a PMD mapped once we return to user space IIUC.

A valid question. The first reason is just semantic symmetry for
MADV_COLLAPSE called on anon vs file/shmem memory. It would be nice to
say that "on success, the memory range provided will be backed by
PMD-mapped hugepages", rather than special-casing file/shmem.

But there will never be such a guarantee, right? We could even see a split before just before we return to user space IIRC.


The second reason has a more practical use case. In userfaultfd-based
live migration (using UFFDIO_REGISTER_MODE_MINOR) pages are migrated
at 4KiB granularity, and it may take a long (O(many minutes)) for the
transfer of all pages to complete. To avoid severe performance
degradation on the target guest, the vmm wants to MADV_COLLAPSE
hugepage-sized regions as they fill up. Since the guest memory is
still uffd-registered, requiring refault post-MADV_COLLAPSE won't
work, since the uffd machinery will intercept the fault, and no PMD
will be mapped. As such, either uffd needs to be taught to install PMD
mappings, or the PMD mapping already must be in-place.

That's an interesting point, thanks. I assume we'd get another minor fault and when resolving that, we'll default to a PTE mapping.

--
Thanks,

David / dhildenb