Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE

From: Andrea Arcangeli
Date: Tue May 25 2004 - 16:28:36 EST


On Tue, May 25, 2004 at 07:48:24AM -0700, Linus Torvalds wrote:
> We'd just pass in a flag to "ptep_establish()" to tell it whether we
> changed the dirty bit or not. It would be "write_access" in
> handle_pte_fault(), and 1 in the other two cases.

I want to show you another approch. This effectively make the end of
handle_pte_fault a noop for x86 and x86-64 and it implements two
operations ptep_handle_[young|dirty]_page_fault(ptep) that allows
architectures with a page-faulting dirty and/or young bits to resolve the
page fault inside their own architecture (since the alpha architecture
generates a page fault for both, there's not even need of atomic
instructions anywhere). The only brainer case is an architecture
generating a page fault for the young bit and not for the dirty bit. In
such case they would need to use something like this:

entry = ptep_get_and_clear(pte);
set_pte(pte, pte_modify(entry, newprot));

Again no atomic instructions. Though with an atomic instruction
like a set_mask64(__DIRTY_BITS, &ptep->pte) such an architecture may run
faster than the above, but the above is guaranteed to work as far as
mprotect works too.

I believe using ptep_establish in handle_mm_fault makes little sense,
to make the most obvious example there's no need of flushing the tlb in
handle_mm_fault to resolve young or dirty page faults. Not a big deal
for x86 and x86-64 that reaches that path only if a race happens, but on
alpha we shouldn't flush the tlb. If some weird architecture really need
to flush the tlb they still can inside
ptep_handle_[young|dirty]_page_fault.

As for set_pte, there's quite a lot of legitimate stuff using it it on
present ptes, even not dirty ones (see zap_pte_* too).

The last issue is ptep_establish, we're flushing the pte in do_wp_page
inside ptep_establish again for no good reason. Those suprious tlb
flushes may even trigger IPIs (this time in x86 smp too even with
processes), so I'd really like to remove the explicit flush in
do_wp_page, however this will likely break s390 but I don't understand
s390 so I'll leave it broken for now (at least to show you this
alternative and to hear comments if it's as broken as the previous one).

Really my basic point for going this direction is that it makes little
sense to me to use a tlb-flushing ptep_establish in both handle_mm_fault
and do_wp_page, where no tlb flush is required in x86, x86-64 and alpha,
so I do see a window for improving performance. Yeah, I'm assuming all
the cpus are smart enough to automatically re-read the pte from memory
if the information they have in the tlb asks for a page-fault. I also
added a trap for non-dirty pte in ptep_establish but it's quite
superflous since there's only 1 ptep_establish caller left.

patch works for me on x86 and it should work on alpha too this time ;).

The really scary thing about this patch is the s390 ptep_establish.

Index: linux-2.5/include/asm-alpha/pgtable.h
===================================================================
RCS file: /home/andrea/crypto/cvs/linux-2.5/include/asm-alpha/pgtable.h,v
retrieving revision 1.22
diff -u -p -r1.22 pgtable.h
--- linux-2.5/include/asm-alpha/pgtable.h 23 May 2004 05:02:25 -0000 1.22
+++ linux-2.5/include/asm-alpha/pgtable.h 25 May 2004 20:34:11 -0000
@@ -267,6 +267,28 @@ extern inline pte_t pte_mkexec(pte_t pte
extern inline pte_t pte_mkdirty(pte_t pte) { pte_val(pte) |= __DIRTY_BITS; return pte; }
extern inline pte_t pte_mkyoung(pte_t pte) { pte_val(pte) |= __ACCESS_BITS; return pte; }

+static inline void ptep_handle_young_page_fault(pte_t *ptep)
+{
+ /*
+ * WARNING: this is safe only because the dirty bit
+ * cannot change trasparently in hardware in the pte.
+ * If the dirty bit in the pte would be set trasparently
+ * by the CPU this should be implemented with set_bit()
+ * or with a new set_mask64() implementing an atomic "or".
+ */
+ set_pte(ptep, pte_mkyoung(*ptep));
+}
+
+static inline void ptep_handle_dirty_page_fault(pte_t *ptep)
+{
+ /*
+ * This can run without atomic instructions even if
+ * the young bit is set in hardware by the architecture.
+ * Losing the young bit is not important.
+ */
+ set_pte(ptep, pte_mkdirty(*ptep));
+}
+
#define PAGE_DIR_OFFSET(tsk,address) pgd_offset((tsk),(address))

/* to find an entry in a kernel page-table-directory */
Index: linux-2.5/include/asm-generic/pgtable.h
===================================================================
RCS file: /home/andrea/crypto/cvs/linux-2.5/include/asm-generic/pgtable.h,v
retrieving revision 1.4
diff -u -p -r1.4 pgtable.h
--- linux-2.5/include/asm-generic/pgtable.h 19 Jan 2004 18:43:03 -0000 1.4
+++ linux-2.5/include/asm-generic/pgtable.h 25 May 2004 20:38:39 -0000
@@ -12,6 +12,7 @@
*/
#define ptep_establish(__vma, __address, __ptep, __entry) \
do { \
+ BUG_ON(!pte_dirty(__entry)); \
set_pte(__ptep, __entry); \
flush_tlb_page(__vma, __address); \
} while (0)
Index: linux-2.5/include/asm-i386/pgtable.h
===================================================================
RCS file: /home/andrea/crypto/cvs/linux-2.5/include/asm-i386/pgtable.h,v
retrieving revision 1.42
diff -u -p -r1.42 pgtable.h
--- linux-2.5/include/asm-i386/pgtable.h 23 May 2004 05:02:25 -0000 1.42
+++ linux-2.5/include/asm-i386/pgtable.h 25 May 2004 20:27:00 -0000
@@ -220,7 +220,19 @@ static inline pte_t pte_mkdirty(pte_t pt
static inline pte_t pte_mkyoung(pte_t pte) { (pte).pte_low |= _PAGE_ACCESSED; return pte; }
static inline pte_t pte_mkwrite(pte_t pte) { (pte).pte_low |= _PAGE_RW; return pte; }

+/*
+ * This is used only to set the dirty bit during a "dirty" page fault.
+ * Nothing to do on x86 since it's set by the hardware transparently
+ * and it cannot generate page faults.
+ */
+#define ptep_handle_dirty_page_fault(ptep) do { } while (0)
static inline int ptep_test_and_clear_dirty(pte_t *ptep) { return test_and_clear_bit(_PAGE_BIT_DIRTY, &ptep->pte_low); }
+/*
+ * This is used only to set the young bit during a "young" page fault.
+ * Nothing to do on x86 since it's set by the hardware transparently
+ * and it cannot generate page faults.
+ */
+#define ptep_handle_young_page_fault(ptep) do { } while (0)
static inline int ptep_test_and_clear_young(pte_t *ptep) { return test_and_clear_bit(_PAGE_BIT_ACCESSED, &ptep->pte_low); }
static inline void ptep_set_wrprotect(pte_t *ptep) { clear_bit(_PAGE_BIT_RW, &ptep->pte_low); }
static inline void ptep_mkdirty(pte_t *ptep) { set_bit(_PAGE_BIT_DIRTY, &ptep->pte_low); }
Index: linux-2.5/include/asm-x86_64/pgtable.h
===================================================================
RCS file: /home/andrea/crypto/cvs/linux-2.5/include/asm-x86_64/pgtable.h,v
retrieving revision 1.27
diff -u -p -r1.27 pgtable.h
--- linux-2.5/include/asm-x86_64/pgtable.h 23 May 2004 05:02:25 -0000 1.27
+++ linux-2.5/include/asm-x86_64/pgtable.h 25 May 2004 20:27:28 -0000
@@ -262,7 +262,19 @@ extern inline pte_t pte_mkexec(pte_t pte
extern inline pte_t pte_mkdirty(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) | _PAGE_DIRTY)); return pte; }
extern inline pte_t pte_mkyoung(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) | _PAGE_ACCESSED)); return pte; }
extern inline pte_t pte_mkwrite(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) | _PAGE_RW)); return pte; }
+/*
+ * This is used only to set the dirty bit during a "dirty" page fault.
+ * Nothing to do on x86-64 since it's set by the hardware transparently
+ * and it cannot generate page faults.
+ */
+#define ptep_handle_dirty_page_fault(ptep) do { } while (0)
static inline int ptep_test_and_clear_dirty(pte_t *ptep) { return test_and_clear_bit(_PAGE_BIT_DIRTY, ptep); }
+/*
+ * This is used only to set the young bit during a "young" page fault.
+ * Nothing to do on x86-64 since it's set by the hardware transparently
+ * and it cannot generate page faults.
+ */
+#define ptep_handle_young_page_fault(ptep) do { } while(0)
static inline int ptep_test_and_clear_young(pte_t *ptep) { return test_and_clear_bit(_PAGE_BIT_ACCESSED, ptep); }
static inline void ptep_set_wrprotect(pte_t *ptep) { clear_bit(_PAGE_BIT_RW, ptep); }
static inline void ptep_mkdirty(pte_t *ptep) { set_bit(_PAGE_BIT_DIRTY, ptep); }
Index: linux-2.5/mm/memory.c
===================================================================
RCS file: /home/andrea/crypto/cvs/linux-2.5/mm/memory.c,v
retrieving revision 1.169
diff -u -p -r1.169 memory.c
--- linux-2.5/mm/memory.c 23 May 2004 05:10:11 -0000 1.169
+++ linux-2.5/mm/memory.c 25 May 2004 20:28:23 -0000
@@ -1643,10 +1643,9 @@ static inline int handle_pte_fault(struc
if (!pte_write(entry))
return do_wp_page(mm, vma, address, pte, pmd, entry);

- entry = pte_mkdirty(entry);
+ ptep_handle_dirty_page_fault(pte);
}
- entry = pte_mkyoung(entry);
- ptep_establish(vma, address, pte, entry);
+ ptep_handle_young_page_fault(pte);
update_mmu_cache(vma, address, entry);
pte_unmap(pte);
spin_unlock(&mm->page_table_lock);
Signed-off-by: Andrea Arcangeli <andrea@xxxxxxx>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/