Re: [PATCH 00/13] KVM: MMU: fast page fault

From: Xiao Guangrong
Date: Mon Apr 09 2012 - 14:13:46 EST


On 04/10/2012 01:58 AM, Marcelo Tosatti wrote:

> On Mon, Apr 09, 2012 at 04:12:46PM +0300, Avi Kivity wrote:
>> On 03/29/2012 11:20 AM, Xiao Guangrong wrote:
>>> * Idea
>>> The present bit of page fault error code (EFEC.P) indicates whether the
>>> page table is populated on all levels, if this bit is set, we can know
>>> the page fault is caused by the page-protection bits (e.g. W/R bit) or
>>> the reserved bits.
>>>
>>> In KVM, in most cases, all this kind of page fault (EFEC.P = 1) can be
>>> simply fixed: the page fault caused by reserved bit
>>> (EFFC.P = 1 && EFEC.RSV = 1) has already been filtered out in fast mmio
>>> path. What we need do to fix the rest page fault (EFEC.P = 1 && RSV != 1)
>>> is just increasing the corresponding access on the spte.
>>>
>>> This pachset introduces a fast path to fix this kind of page fault: it
>>> is out of mmu-lock and need not walk host page table to get the mapping
>>> from gfn to pfn.
>>>
>>>
>>
>> This patchset is really worrying to me.
>>
>> It introduces a lot of concurrency into data structures that were not
>> designed for it. Even if it is correct, it will be very hard to
>> convince ourselves that it is correct, and if it isn't, to debug those
>> subtle bugs. It will also be much harder to maintain the mmu code than
>> it is now.
>>
>> There are a lot of things to check. Just as an example, we need to be
>> sure that if we use rcu_dereference() twice in the same code path, that
>> any inconsistencies due to a write in between are benign. Doing that is
>> a huge task.
>>
>> But I appreciate the performance improvement and would like to see a
>> simpler version make it in. This needs to reduce the amount of data
>> touched in the fast path so it is easier to validate, and perhaps reduce
>> the number of cases that the fast path works on.
>>
>> I would like to see the fast path as simple as
>>
>> rcu_read_lock();
>>
>> (lockless shadow walk)
>> spte = ACCESS_ONCE(*sptep);
>>
>> if (!(spte & PT_MAY_ALLOW_WRITES))
>> goto slow;
>>
>> gfn = kvm_mmu_page_get_gfn(sp, sptep - sp->sptes)
>> mark_page_dirty(kvm, gfn);
>>
>> new_spte = spte & ~(PT64_MAY_ALLOW_WRITES | PT_WRITABLE_MASK);
>> if (cmpxchg(sptep, spte, new_spte) != spte)
>> goto slow;
>>
>> rcu_read_unlock();
>> return;
>>
>> slow:
>> rcu_read_unlock();
>> slow_path();
>>
>> It now becomes the responsibility of the slow path to maintain *sptep &
>> PT_MAY_ALLOW_WRITES, but that path has a simpler concurrency model. It
>> can be as simple as a clear_bit() before we update sp->gfns[] or if we
>> add host write protection.
>>
>> Sorry, it's too complicated for me. Marcelo, what's your take?
>
> The improvement is small and limited to special cases (migration should
> be rare and framebuffer memory accounts for a small percentage of total
> memory).
>
> For one, how can this be safe against mmu notifier methods?
>
> KSM |VCPU0 | VCPU1
> | fault | fault
> | cow-page |
> | set spte RW |
> | |
> write protect host pte | |
> grab mmu_lock | |
> remove writeable bit in spte | |
> increase mmu_notifier_seq | | spte = read-only spte
> release mmu_lock | | cmpxchg succeeds, RO->RW!
>
> MMU notifiers rely on the fault path sequence being
>
> read host pte
> read mmu_notifier_seq
> spin_lock(mmu_lock)
> if (mmu_notifier_seq changed)
> goodbye, host pte value is stale
> spin_unlock(mmu_lock)
>
> By the example above, you cannot rely on the spte value alone,
> mmu_notifier_seq must be taken into account.


No.

When KSM change the host page to read-only, the HOST_WRITABLE bit
of spte should be removed, that means, the spte should be changed
that can be watched by cmpxchg.

Note: we mark spte to be writeable only if spte.HOST_WRITABLE is
set.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/