Re: [RFC/PATCH] mm/futex: Fix futex writes on archs with SW trackingof dirty & young

From: Shan Hai
Date: Tue Jul 19 2011 - 04:22:14 EST


On 07/19/2011 03:46 PM, Benjamin Herrenschmidt wrote:
On Tue, 2011-07-19 at 13:38 +0800, Shan Hai wrote:

What you said is another path, that is futex_wake_op(),
but what about futex_lock_pi in which my test case failed?
your patch will call handle_mm_fault on every futex contention
in the futex_lock_pi path.

futex_lock_pi()
ret = futex_lock_pi_atomic(uaddr, hb,&q.key,&q.pi_state, current, 0);
case -EFAULT:
goto uaddr_faulted;

...
uaddr_faulted:
ret = fault_in_user_writeable(uaddr);
Euh ... and how do we get to uaddr_faulted ? You may have missed the
return statement right before it :-)

From what I can tell we only get there as a result of -EFAULT from
futex_lock_pi_atomic() which is exactly the case we are trying to
cover.


Got it, if the fault_in_user_writeable() is designed to catch the
exact same write permission fault problem we discuss here, so
your patch fixed that very nicely, we should fixup it by directly
calling handle_mm_fault like what you did because we are for sure
to know what just happened(permission violation), its not necessary
to check what's happened by calling gup-->follow_page, and
further the follow_page failed to report the fault :-)

Thanks
Shan Hai

.../...
"[PATCH 1/1] Fixup write permission of TLB on powerpc e500 core"?
(I will fix the stupid errors in my original patch if the concept
is acceptable)
in this way we could decrease the overhead of handle_mm_fault
in the path which does not need write permission fixup.
Which overhead ? gup does handle_mm_fault() as well if needed.
it does it *if needed*, and this requirement is rare in my opinion.
And how does gup figure out of it's needed ? By walking down the page
tables in follow_page... what does handle_mm_fault do ? walk down the
page tables...

The main (if not the only) relevant difference here, is going to be the
spurious fault TLB invaliate for writes ... which is a nop on x86....
and needed in all the cases we care about (and if it's not needed, then
it's up to the arch to nop it out, we should probably do it on powerpc
too ... but that's un unrelated discussion).

Cheers,
Ben.

Thanks
Shan Hai

What I do is I replace what is arguably an abuse of gup() in the case
where a fixup -is- needed with a dedicated function designed to perform
the said fixup ... and do it properly which gup() didn't :-)

Cheers,
Ben.

Thanks
Shan Hai
Cheers,
Ben.

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9670f71..1036614 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -985,6 +985,8 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
int get_user_pages_fast(unsigned long start, int nr_pages, int write,
struct page **pages);
struct page *get_dump_page(unsigned long addr);
+extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
+ unsigned long address, unsigned int fault_flags);

extern int try_to_release_page(struct page * page, gfp_t gfp_mask);
extern void do_invalidatepage(struct page *page, unsigned long offset);
diff --git a/kernel/futex.c b/kernel/futex.c
index fe28dc2..7a0a4ed 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -355,8 +355,8 @@ static int fault_in_user_writeable(u32 __user *uaddr)
int ret;

down_read(&mm->mmap_sem);
- ret = get_user_pages(current, mm, (unsigned long)uaddr,
- 1, 1, 0, NULL, NULL);
+ ret = fixup_user_fault(current, mm, (unsigned long)uaddr,
+ FAULT_FLAG_WRITE);
up_read(&mm->mmap_sem);

return ret< 0 ? ret : 0;
diff --git a/mm/memory.c b/mm/memory.c
index 40b7531..b967fb0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1815,7 +1815,64 @@ next_page:
}
EXPORT_SYMBOL(__get_user_pages);

-/**
+/*
+ * fixup_user_fault() - manually resolve a user page fault
+ * @tsk: the task_struct to use for page fault accounting, or
+ * NULL if faults are not to be recorded.
+ * @mm: mm_struct of target mm
+ * @address: user address
+ * @fault_flags:flags to pass down to handle_mm_fault()
+ *
+ * This is meant to be called in the specific scenario where for
+ * locking reasons we try to access user memory in atomic context
+ * (within a pagefault_disable() section), this returns -EFAULT,
+ * and we want to resolve the user fault before trying again.
+ *
+ * Typically this is meant to be used by the futex code.
+ *
+ * The main difference with get_user_pages() is that this function
+ * will unconditionally call handle_mm_fault() which will in turn
+ * perform all the necessary SW fixup of the dirty and young bits
+ * in the PTE, while handle_mm_fault() only guarantees to update
+ * these in the struct page.
+ *
+ * This is important for some architectures where those bits also
+ * gate the access permission to the page because their are
+ * maintained in software. On such architecture, gup() will not
+ * be enough to make a subsequent access succeed.
+ *
+ * This should be called with the mm_sem held for read.
+ */
+int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
+ unsigned long address, unsigned int fault_flags)
+{
+ struct vm_area_struct *vma;
+ int ret;
+
+ vma = find_extend_vma(mm, address);
+ if (!vma || address< vma->vm_start)
+ return -EFAULT;
+
+ ret = handle_mm_fault(mm, vma, address, fault_flags);
+ if (ret& VM_FAULT_ERROR) {
+ if (ret& VM_FAULT_OOM)
+ return -ENOMEM;
+ if (ret& (VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE))
+ return -EHWPOISON;
+ if (ret& VM_FAULT_SIGBUS)
+ return -EFAULT;
+ BUG();
+ }
+ if (tsk) {
+ if (ret& VM_FAULT_MAJOR)
+ tsk->maj_flt++;
+ else
+ tsk->min_flt++;
+ }
+ return 0;
+}
+
+/*
* get_user_pages() - pin user pages in memory
* @tsk: the task_struct to use for page fault accounting, or
* NULL if faults are not to be recorded.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/