[PATCH] mmu_notifier: fix race in set_pte_at_notify usage

From: Andrea Arcangeli
Date: Tue Aug 21 2012 - 10:48:11 EST


Whenever we establish a writable spte with set_pte_at_notify the
ptep_clear_flush before it must be a _notify one that clears the spte
too.

The fundamental problem is that if the primary MMU that writes to the
"newpage" won't fault again if the pte established by
set_pte_at_notify is writable. And so it will fail to serialize
against the PT lock to wait the set_pte_at_notify to finish
updating all secondary MMUs before the write hits the newpage.

CPU0 CPU1
oldpage[1] == 0 (all MMUs)
oldpage[0] = 1
trigger do_wp_page
take PT lock
ptep_clear_flush (secondary MMUs
still have read access to oldpage)
mmu_notifier_change_pte
pte = newpage + writable (primary MMU can write to
newpage)
host write newpage[1] == 1 (no fault,
failed to serialize against PT lock)
vmenter
guest read oldpage[1] == 0
spte = newpage + writable (too late)

It's safe to use set_pte_at_notify with a ptep_clear_flush (_notify
not) only if we establish a readonly pte with it (like KSM does)
because in that case the write done by the primary MMU will fault and
serialize against the PT lock.

set_pte_at_notify is still worth to use even if we have to do
ptep_clear_flush_notify before it, because it will still avoid the
secondary MMU to trigger secondary MMU page faults to access the new
page (if it has sptes and it's not only a TLB with a TLB miss
implemented by follow_page).

Signed-off-by: Andrea Arcangeli <aarcange@xxxxxxxxxx>
---
include/linux/mmu_notifier.h | 7 +++++++
mm/memory.c | 2 +-
2 files changed, 8 insertions(+), 1 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index ee2baf0..cce4e4f 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -75,6 +75,13 @@ struct mmu_notifier_ops {
/*
* change_pte is called in cases that pte mapping to page is changed:
* for example, when ksm remaps pte to point to a new shared page.
+ *
+ * NOTE: If this method is used to setup a writable pte, it
+ * must be preceded by a secondary MMU invalidate before the
+ * pte is established in the primary MMU. That is required to
+ * avoid the old page won't be still be readable by the
+ * secondary MMUs after the primary MMU gains write access to
+ * the newpage.
*/
void (*change_pte)(struct mmu_notifier *mn,
struct mm_struct *mm,
diff --git a/mm/memory.c b/mm/memory.c
index ec12fc9..88749f3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2720,7 +2720,7 @@ gotten:
* seen in the presence of one thread doing SMC and another
* thread doing COW.
*/
- ptep_clear_flush(vma, address, page_table);
+ ptep_clear_flush_notify(vma, address, page_table);
page_add_new_anon_rmap(new_page, vma, address);
/*
* We call the notify macro here because, when using secondary
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/