[PATCH] thp: fix leak due split_huge_page() vs. exit race

From: Kirill A. Shutemov
Date: Wed Nov 11 2015 - 05:09:38 EST


Consider following race:

CPU0 CPU1
shrink_page_list()
add_to_swap()
split_huge_page_to_list()
__split_huge_pmd_locked()
pmdp_huge_clear_flush_notify()
// pmd_none() == true
exit_mmap()
unmap_vmas()
zap_pmd_range()
// no action on pmd since pmd_none() == true
pmd_populate()

As result the THP will not be freed. The leak is detected by check_mm():

BUG: Bad rss-counter state mm:ffff880058d2e580 idx:1 val:512

The patch restore the logic original split_huge_page() had before
refcounting rework: never have intermediate pmd_none() == true.

There are few other places where we do have pmd_none() == true for some
time, but they are safe:

- __split_huge_zero_page_pmd() is not reachable during exit, since huge
zero page is not on LRU.

- do_huge_pmd_wp_page() and do_huge_pmd_wp_page_fallback() are also not
reachable during exit: exit_mmap() and handling page fault for the mm
are mutual exclusive.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@xxxxxxxxxxxxxxx>
---
mm/huge_memory.c | 25 ++++++++++++++++++++++---
1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 6834b39a7114..91e2f4b7ca39 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2809,9 +2809,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
write = pmd_write(*pmd);
young = pmd_young(*pmd);

- /* leave pmd empty until pte is filled */
- pmdp_huge_clear_flush_notify(vma, haddr, pmd);
-
pgtable = pgtable_trans_huge_withdraw(mm, pmd);
pmd_populate(mm, &_pmd, pgtable);

@@ -2861,6 +2858,28 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
}

smp_wmb(); /* make pte visible before pmd */
+ /*
+ * Up to this point the pmd is present and huge and userland has the
+ * whole access to the hugepage during the split (which happens in
+ * place). If we overwrite the pmd with the not-huge version pointing
+ * to the pte here (which of course we could if all CPUs were bug
+ * free), userland could trigger a small page size TLB miss on the
+ * small sized TLB while the hugepage TLB entry is still established in
+ * the huge TLB. Some CPU doesn't like that.
+ * See http://support.amd.com/us/Processor_TechDocs/41322.pdf, Erratum
+ * 383 on page 93. Intel should be safe but is also warns that it's
+ * only safe if the permission and cache attributes of the two entries
+ * loaded in the two TLB is identical (which should be the case here).
+ * But it is generally safer to never allow small and huge TLB entries
+ * for the same virtual address to be loaded simultaneously. So instead
+ * of doing "pmd_populate(); flush_pmd_tlb_range();" we first mark the
+ * current pmd notpresent (atomically because here the pmd_trans_huge
+ * and pmd_trans_splitting must remain set at all times on the pmd
+ * until the split is complete for this pmd), then we flush the SMP TLB
+ * and finally we write the non-huge version of the pmd entry with
+ * pmd_populate.
+ */
+ pmdp_invalidate(vma, haddr, pmd);
pmd_populate(mm, pmd, pgtable);

if (freeze) {
--
2.6.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/