Re: [PATCH v7 07/12] khugepaged: add mTHP support

From: Dev Jain
Date: Sat Jun 07 2025 - 10:32:48 EST



On 07/06/25 6:33 pm, Nico Pache wrote:
On Sat, Jun 7, 2025 at 12:24 AM Dev Jain <dev.jain@xxxxxxx> wrote:

On 15/05/25 8:52 am, Nico Pache wrote:
Introduce the ability for khugepaged to collapse to different mTHP sizes.
While scanning PMD ranges for potential collapse candidates, keep track
of pages in KHUGEPAGED_MIN_MTHP_ORDER chunks via a bitmap. Each bit
represents a utilized region of order KHUGEPAGED_MIN_MTHP_ORDER ptes. If
mTHPs are enabled we remove the restriction of max_ptes_none during the
scan phase so we dont bailout early and miss potential mTHP candidates.

After the scan is complete we will perform binary recursion on the
bitmap to determine which mTHP size would be most efficient to collapse
to. max_ptes_none will be scaled by the attempted collapse order to
determine how full a THP must be to be eligible.

If a mTHP collapse is attempted, but contains swapped out, or shared
pages, we dont perform the collapse.

For non PMD collapse we much leave the anon VMA write locked until after
we collapse the mTHP
Why? I know that Hugh pointed out locking errors; I am yet to catch up
on that thread, but you need to explain in the description why you do
what you do.

[--snip---]

-
- spin_lock(pmd_ptl);
- BUG_ON(!pmd_none(*pmd));
- folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
- folio_add_lru_vma(folio, vma);
- pgtable_trans_huge_deposit(mm, pmd, pgtable);
- set_pmd_at(mm, address, pmd, _pmd);
- update_mmu_cache_pmd(vma, address, pmd);
- deferred_split_folio(folio, false);
- spin_unlock(pmd_ptl);
+ if (order == HPAGE_PMD_ORDER) {
+ pgtable = pmd_pgtable(_pmd);
+ _pmd = folio_mk_pmd(folio, vma->vm_page_prot);
+ _pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
+
+ spin_lock(pmd_ptl);
+ BUG_ON(!pmd_none(*pmd));
+ folio_add_new_anon_rmap(folio, vma, _address, RMAP_EXCLUSIVE);
+ folio_add_lru_vma(folio, vma);
+ pgtable_trans_huge_deposit(mm, pmd, pgtable);
+ set_pmd_at(mm, address, pmd, _pmd);
+ update_mmu_cache_pmd(vma, address, pmd);
+ deferred_split_folio(folio, false);
+ spin_unlock(pmd_ptl);
+ } else { /* mTHP collapse */
+ mthp_pte = mk_pte(&folio->page, vma->vm_page_prot);
+ mthp_pte = maybe_mkwrite(pte_mkdirty(mthp_pte), vma);
+
+ spin_lock(pmd_ptl);
Nico,

I've noticed a few occasions where my review comments have not been acknowledged -
for example, [1]. It makes it difficult to follow up and contributes to some
frustration on my end. I'd appreciate if you could make sure to respond to
feedback, even if you are disagreeing with my comments. Thanks!
I'm sorry you feel that way, are there any others? I feel like I've
been pretty good at responding to all comments. I've also been out of
the office for the last month, so keeping up with upstream has been
more difficult, but i'm back now.

No issues, there were others but I don't want to waste our time digging
them up, when we are on the same page!


Sorry I never got back to you on that one! I will add the BUG_ON, but
I believe it's unnecessary. Your changeset was focused on different
functionality and it seems that you had a bug in it if you were
hitting that often.

In my original reply, when I said "I hit the BUG_ON a lot of times",
I meant, during testing. It was quite difficult to extend for non-PMD
sized VMAs, and the BUG_ON was getting hit due to rmap reaching the
non-isolated folios and somehow installing the PMD again. That is
why I say that the BUG_ON is important since it will help us catch
bugs early. And we have that for the PMD case anyways so why not
for mTHP...


Cheers,
-- Nico

[1] https://lore.kernel.org/all/08d13445-5ed1-42ea-8aee-c1dbde24407e@xxxxxxx/


[---snip---]