Re: [PATCH 1/1] Fixup write permission of TLB on powerpc e500 core

From: Shan Hai
Date: Mon Jul 18 2011 - 02:45:50 EST

On 07/18/2011 12:01 PM, Benjamin Herrenschmidt wrote:
On Mon, 2011-07-18 at 09:14 +1000, Benjamin Herrenschmidt wrote:

In fact, with such a flag, we could probably avoid the ifdef entirely, and
always go toward the PTE fixup path when called in such a fixup case, my gut
feeling is that this is going to be seldom enough not to hurt x86 measurably
but we'll have to try it out.

That leads to that even less tested patch:
And here's a version that builds and fixes a bug or two
(still not tested :-)

Shan, can you verify whether that fixes the problem for you ?

It could not fix the problem, refer the following reply for
the reasons.

I also had a cursory glance at the ARM code and it seems to rely on the
same stuff as embedded powerpc does for dirty/young updates, so in
theory it should exhibit the same problem.

I suspect the scenario is rare enough in practice in embedded workloads
that nobody noticed until now.


mm/futex: Fix use of gup() to "fixup" failing atomic user accesses

The futex code uses atomic (page fault disabled) accesses to user space,
and when they fail, uses get_user_pages() to "fixup" the PTE and try again.

However, on arch with SW tracking of the dirty and young bits, this will
not work properly as neither of the above will perform the necessary fixup
of those bits.

There's also a possible corner cases with archs who rely on
handle_pte_fault() to invalidate the TLB for "spurrious" faults (though
I don't know which arch actually needs that). Those would break the
same way.

This fixes it by factoring out the fixup code from handle_pte_fault() into
a separate function, and use it from within gup as well, whenever the
FOLL_FIXFAULT flag has been passed to it. The futex code is modified to
pass that flag.

This doesn't change the "normal" gup case (and thus avoids the overhead
of doing that tracking)

Signed-off-by: Benjamin Herrenschmidt<benh@xxxxxxxxxxxxxxxxxxx>

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9670f71..8a76694 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1546,6 +1546,7 @@ struct page *follow_page(struct vm_area_struct *, unsigned long address,
#define FOLL_MLOCK 0x40 /* mark page as mlocked */
#define FOLL_SPLIT 0x80 /* don't return transhuge pages, split them */
#define FOLL_HWPOISON 0x100 /* check page is hwpoisoned */
+#define FOLL_FIXFAULT 0x200 /* fixup after a fault (PTE dirty/young upd) */

typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
void *data);
diff --git a/kernel/futex.c b/kernel/futex.c
index fe28dc2..02adff7 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -355,8 +355,8 @@ static int fault_in_user_writeable(u32 __user *uaddr)
int ret;

- ret = get_user_pages(current, mm, (unsigned long)uaddr,
- 1, 1, 0, NULL, NULL);
+ ret = __get_user_pages(current, mm, (unsigned long)uaddr, 1,

the FOLL_FIXFAULT is filtered out at the following code
if (write)
flags |= FOLL_WRITE;


return ret< 0 ? ret : 0;
diff --git a/mm/memory.c b/mm/memory.c
index 40b7531..3c4d502 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1419,6 +1419,29 @@ int zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,

+static void handle_pte_sw_young_dirty(struct vm_area_struct *vma,
+ unsigned long address,
+ pte_t *ptep, int write)
+ pte_t entry = *ptep;
+ if (write)
+ pte_mkdirty(entry);
+ entry = pte_mkyoung(entry);
+ if (ptep_set_access_flags(vma, address, ptep, entry, write)) {
+ update_mmu_cache(vma, address, ptep);
+ } else {
+ /*
+ * This is needed only for protection faults but the arch code
+ * is not yet telling us if this is a protection fault or not.
+ * This still avoids useless tlb flushes for .text page faults
+ * with threads.
+ */
+ if (write)
+ flush_tlb_fix_spurious_fault(vma, address);
+ }
* follow_page - look up a page descriptor from a user-virtual address
* @vma: vm_area_struct mapping @address
@@ -1514,6 +1537,10 @@ split_fallthrough:

if (flags& FOLL_GET)
+ if (flags& FOLL_FIXFAULT)
+ handle_pte_sw_young_dirty(vma, address, ptep,
+ flags& FOLL_WRITE);
if (flags& FOLL_TOUCH) {
if ((flags& FOLL_WRITE)&&
!pte_dirty(pte)&& !PageDirty(page))

call handle_pte_sw_young_dirty before !pte_dirty(pte)
might has problems.

@@ -1525,6 +1552,7 @@ split_fallthrough:
if ((flags& FOLL_MLOCK)&& (vma->vm_flags& VM_LOCKED)) {
* The preliminary mapping check is mainly to avoid the
@@ -3358,21 +3386,8 @@ int handle_pte_fault(struct mm_struct *mm,
if (!pte_write(entry))
return do_wp_page(mm, vma, address,
pte, pmd, ptl, entry);
- entry = pte_mkdirty(entry);
- }
- entry = pte_mkyoung(entry);
- if (ptep_set_access_flags(vma, address, pte, entry, flags& FAULT_FLAG_WRITE)) {
- update_mmu_cache(vma, address, pte);
- } else {
- /*
- * This is needed only for protection faults but the arch code
- * is not yet telling us if this is a protection fault or not.
- * This still avoids useless tlb flushes for .text page faults
- * with threads.
- */
- if (flags& FAULT_FLAG_WRITE)
- flush_tlb_fix_spurious_fault(vma, address);
+ handle_pte_sw_young_dirty(vma, address, pte, flags& FAULT_FLAG_WRITE);
pte_unmap_unlock(pte, ptl);
return 0;

So what about the following?
diff --git a/mm/memory.c b/mm/memory.c
index 9b8a01d..fb48122 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1442,6 +1442,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsig
spinlock_t *ptl;
struct page *page;
struct mm_struct *mm = vma->vm_mm;
+ int fix_write_permission = false;

page = follow_huge_addr(mm, address, flags & FOLL_WRITE);
if (!IS_ERR(page)) {
@@ -1519,6 +1520,11 @@ split_fallthrough:
if ((flags & FOLL_WRITE) &&
!pte_dirty(pte) && !PageDirty(page))
+ if ((flags & FOLL_WRITE) && !pte_dirty(pte))
+ fix_write_permission = true;
* pte_mkyoung() would be more correct here, but atomic care
* is needed to avoid losing the dirty bit: it is easier to use
@@ -1551,7 +1557,7 @@ split_fallthrough:
pte_unmap_unlock(ptep, ptl);
- return page;
+ return (fix_write_permission == true) ? NULL: page;

pte_unmap_unlock(ptep, ptl);

(flags & FOLL_WRITE) && !pte_dirty(pte) the follow_page()
could figure out that the caller want to write to the
(present && writable && non-dirty) pte, and the architecture
want to fixup the problem by indicating CONFIG_FIXUP_WRITE_PERMISSION,
so let the follow_page() return NULL to the __get_user_pages, and
let the handle_mm_fault to fixup dirty/young tracking.

Checking the following code we can conclude that the handle_mm_fault
is ready to handle the faults and the write permission violation is
a kind of fault, so why don't we let the handle_mm_fault to
handle that fault?

while (!(page = follow_page(vma, start, foll_flags))) {
ret = handle_mm_fault(mm, vma, start,

Shan Hai
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at