Re: [PATCH/RFC] mm: do not drop unused pages when userfaultd is running

From: Christian Borntraeger
Date: Thu Jun 28 2018 - 10:51:33 EST




On 06/28/2018 04:49 PM, David Hildenbrand wrote:
> On 28.06.2018 16:39, Christian Borntraeger wrote:
>>
>>
>> On 06/28/2018 03:18 PM, David Hildenbrand wrote:
>>> On 28.06.2018 14:39, Christian Borntraeger wrote:
>>>> KVM guests on s390 can notify the host of unused pages. This can result
>>>> in pte_unused callbacks to be true for KVM guest memory.
>>>>
>>>> If a page is unused (checked with pte_unused) we might drop this page
>>>> instead of paging it. This can have side-effects on userfaultd, when the
>>>> page in question was already migrated:
>>>>
>>>> The next access of that page will trigger a fault and a user fault
>>>> instead of faulting in a new and empty zero page. As QEMU does not
>>>> expect a userfault on an already migrated page this migration will fail.
>>>>
>>>> The most straightforward solution is to ignore the pte_unused hint if a
>>>> userfault context is active for this VMA.
>>>>
>>>> Cc: Martin Schwidefsky <schwidefsky@xxxxxxxxxx>
>>>> Cc: Andrea Arcangeli <aarcange@xxxxxxxxxx>
>>>> Cc: stable@xxxxxxxxxxxxxxx
>>>> Signed-off-by: Christian Borntraeger <borntraeger@xxxxxxxxxx>
>>>> ---
>>>> mm/rmap.c | 2 +-
>>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>> index 6db729dc4c50..3f3a72aa99f2 100644
>>>> --- a/mm/rmap.c
>>>> +++ b/mm/rmap.c
>>>> @@ -1481,7 +1481,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>>>> set_pte_at(mm, address, pvmw.pte, pteval);
>>>> }
>>>>
>>>> - } else if (pte_unused(pteval)) {
>>>> + } else if (pte_unused(pteval) && !vma->vm_userfaultfd_ctx.ctx) {
>>>> /*
>>>> * The guest indicated that the page content is of no
>>>> * interest anymore. Simply discard the pte, vmscan
>>>>
>>>
>>> To understand the implications better:
>>>
>>> This is like a MADV_DONTNEED from user space while a userfaultfd
>>> notifier is registered for this vma range.
>>>
>>> While we can block such calls in QEMU ("we registered it, we know it
>>> best"), we can't do the same in the kernel.
>>>
>>> These "intern MADV_DONTNEED" can actually trigger "deferred", so e.g. if
>>> the pte_unused() was set before userfaultfd has been registered, we can
>>> still get the same result, right?>
>> Not sure I understand your last sentence.
>
> Rephrased: Instead trying to stop somebody from setting pte_unused will
> not work, as we might get a userfaultfd registration at some point and
> find a previously set pte_unused afterwards.

Yes, exactly. the unused value can be set before the migration.


>
>> This place here is called on the unmap, (e.g. when the host tries to page out).
>> The value was transferred before (and always before) during the page table invalidation.
>> So pte_unused was always set before. This is the place where we decide if we page
>> out (ans establish a swap pte) or just drop this page table entry. So if
>> no userfaultd is registered at that point in time we are good.
>
> This certainly applies to ordinary userfaultfd we have right now.
> userfaultfd WP (write-protect) or other features to come might be
> different, but it does not seem to do any harm in case we page out
> instead of dropping it. This way we are on the safe side.

yes.
>
> In other words: I think this is the right approach.