Re: [PATCH 0/1] Fixup write permission of TLB on powerpc e500 core

From: Shan Hai
Date: Fri Jul 15 2011 - 05:06:37 EST


On 07/15/2011 04:44 PM, Peter Zijlstra wrote:
On Fri, 2011-07-15 at 16:38 +0800, MailingLists wrote:
On 07/15/2011 04:20 PM, Peter Zijlstra wrote:
On Fri, 2011-07-15 at 16:07 +0800, Shan Hai wrote:
The following test case could reveal a bug in the futex_lock_pi()

BUG: On FUTEX_LOCK_PI, there is a infinite loop in the futex_lock_pi()
on Powerpc e500 core.
Cause: The linux kernel on the e500 core has no write permission on
the COW page, refer the head comment of the following test code.

ftrace on test case:
[000] 353.990181: futex_lock_pi_atomic<-futex_lock_pi
[000] 353.990185: cmpxchg_futex_value_locked<-futex_lock_pi_atomic
[snip]
[000] 353.990191: do_page_fault<-handle_page_fault
[000] 353.990192: bad_page_fault<-handle_page_fault
[000] 353.990193: search_exception_tables<-bad_page_fault
[snip]
[000] 353.990199: get_user_pages<-fault_in_user_writeable
[snip]
[000] 353.990208: mark_page_accessed<-follow_page
[000] 353.990222: futex_lock_pi_atomic<-futex_lock_pi
[snip]
[000] 353.990230: cmpxchg_futex_value_locked<-futex_lock_pi_atomic
[ a loop occures here ]

But but but but, that get_user_pages(.write=1, .force=0) should result
in a COW break, getting our own writable page.

What is this e500 thing smoking that this doesn't work?
A page could be set to read only by the kernel (supervisor in the powerpc
literature) on the e500, and that's what the kernel do. Set SW(supervisor
write) bit in the TLB entry to grant write permission to the kernel on a
page.

And further the SW bit is set according to the DIRTY flag of the PTE,
PTE.DIRTY is set in the do_page_fault(), the futex_lock_pi() disabled
page fault, the PTE.DIRTY never can be set, so do the SW bit, unbreakable
COW occurred, infinite loop followed.
I'm fairly sure fault_in_user_writeable() has PF enabled as it takes
mmap_sem, an pagefaul_disable() is akin to preemp_disable() on mainline.

Also get_user_pages() fully expects to be able to schedule, and in fact
can call the full pf handler path all by its lonesome self.

The whole scenario should be,
- the child process triggers a page fault at the first time access to
the lock, and it got its own writable page, but its *clean* for
the reason just for checking the status of the lock.
I am sorry for above "unbreakable COW".
- the futex_lock_pi() is invoked because of the lock contention,
and the futex_atomic_cmpxchg_inatomic() tries to get the lock,
it found out the lock is free so tries to write to the lock for
reservation, a page fault occurs, because the page is read only
for kernel(e500 specific), and returns -EFAULT to the caller
- the fault_in_user_writeable() tries to fix the fault,
but from the get_user_pages() view everything is ok, because
the COW was already broken, retry futex_lock_pi_atomic()
- futex_lock_pi_atomic() --> futex_atomic_cmpxchg_inatomic(),
another write protection page fault
- infinite loop

Thanks
Shan Hai


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/