[PATCH 4.9 060/148] thp: fix MADV_DONTNEED vs. numa balancing race

From: Greg Kroah-Hartman
Date: Tue Dec 12 2017 - 07:57:36 EST


4.9-stable review patch. If anyone has any objections, please let me know.

------------------

From: Kirill A. Shutemov <kirill.shutemov@xxxxxxxxxxxxxxx>

commit ced108037c2aa542b3ed8b7afd1576064ad1362a upstream.

In case prot_numa, we are under down_read(mmap_sem). It's critical to
not clear pmd intermittently to avoid race with MADV_DONTNEED which is
also under down_read(mmap_sem):

CPU0: CPU1:
change_huge_pmd(prot_numa=1)
pmdp_huge_get_and_clear_notify()
madvise_dontneed()
zap_pmd_range()
pmd_trans_huge(*pmd) == 0 (without ptl)
// skip the pmd
set_pmd_at();
// pmd is re-established

The race makes MADV_DONTNEED miss the huge pmd and don't clear it
which may break userspace.

Found by code analysis, never saw triggered.

Link: http://lkml.kernel.org/r/20170302151034.27829-3-kirill.shutemov@xxxxxxxxxxxxxxx
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@xxxxxxxxxxxxxxx>
Cc: Andrea Arcangeli <aarcange@xxxxxxxxxx>
Cc: Hillf Danton <hillf.zj@xxxxxxxxxxxxxxx>
Cc: <stable@xxxxxxxxxxxxxxx>
Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
Signed-off-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
[jwang: adjust context for 4.9 ]
Signed-off-by: Jack Wang <jinpu.wang@xxxxxxxxxxxxxxxx>
Signed-off-by: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx>
---
mm/huge_memory.c | 34 +++++++++++++++++++++++++++++++++-
1 file changed, 33 insertions(+), 1 deletion(-)

--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1531,7 +1531,39 @@ int change_huge_pmd(struct vm_area_struc
if (prot_numa && pmd_protnone(*pmd))
goto unlock;

- entry = pmdp_huge_get_and_clear_notify(mm, addr, pmd);
+ /*
+ * In case prot_numa, we are under down_read(mmap_sem). It's critical
+ * to not clear pmd intermittently to avoid race with MADV_DONTNEED
+ * which is also under down_read(mmap_sem):
+ *
+ * CPU0: CPU1:
+ * change_huge_pmd(prot_numa=1)
+ * pmdp_huge_get_and_clear_notify()
+ * madvise_dontneed()
+ * zap_pmd_range()
+ * pmd_trans_huge(*pmd) == 0 (without ptl)
+ * // skip the pmd
+ * set_pmd_at();
+ * // pmd is re-established
+ *
+ * The race makes MADV_DONTNEED miss the huge pmd and don't clear it
+ * which may break userspace.
+ *
+ * pmdp_invalidate() is required to make sure we don't miss
+ * dirty/young flags set by hardware.
+ */
+ entry = *pmd;
+ pmdp_invalidate(vma, addr, pmd);
+
+ /*
+ * Recover dirty/young flags. It relies on pmdp_invalidate to not
+ * corrupt them.
+ */
+ if (pmd_dirty(*pmd))
+ entry = pmd_mkdirty(entry);
+ if (pmd_young(*pmd))
+ entry = pmd_mkyoung(entry);
+
entry = pmd_modify(entry, newprot);
if (preserve_write)
entry = pmd_mkwrite(entry);