Re: [PATCH] vfio/type1: Unpin zero pages

From: David Hildenbrand
Date: Tue Aug 30 2022 - 11:44:01 EST


On 30.08.22 17:11, Alex Williamson wrote:
> On Tue, 30 Aug 2022 09:59:33 +0200
> David Hildenbrand <david@xxxxxxxxxx> wrote:
>
>> On 30.08.22 05:05, Alex Williamson wrote:
>>> There's currently a reference count leak on the zero page. We increment
>>> the reference via pin_user_pages_remote(), but the page is later handled
>>> as an invalid/reserved page, therefore it's not accounted against the
>>> user and not unpinned by our put_pfn().
>>>
>>> Introducing special zero page handling in put_pfn() would resolve the
>>> leak, but without accounting of the zero page, a single user could
>>> still create enough mappings to generate a reference count overflow.
>>>
>>> The zero page is always resident, so for our purposes there's no reason
>>> to keep it pinned. Therefore, add a loop to walk pages returned from
>>> pin_user_pages_remote() and unpin any zero pages.
>>>
>>> Cc: David Hildenbrand <david@xxxxxxxxxx>
>>> Cc: stable@xxxxxxxxxxxxxxx
>>> Reported-by: Luboslav Pivarc <lpivarc@xxxxxxxxxx>
>>> Signed-off-by: Alex Williamson <alex.williamson@xxxxxxxxxx>
>>> ---
>>> drivers/vfio/vfio_iommu_type1.c | 12 ++++++++++++
>>> 1 file changed, 12 insertions(+)
>>>
>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>>> index db516c90a977..8706482665d1 100644
>>> --- a/drivers/vfio/vfio_iommu_type1.c
>>> +++ b/drivers/vfio/vfio_iommu_type1.c
>>> @@ -558,6 +558,18 @@ static int vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr,
>>> ret = pin_user_pages_remote(mm, vaddr, npages, flags | FOLL_LONGTERM,
>>> pages, NULL, NULL);
>>> if (ret > 0) {
>>> + int i;
>>> +
>>> + /*
>>> + * The zero page is always resident, we don't need to pin it
>>> + * and it falls into our invalid/reserved test so we don't
>>> + * unpin in put_pfn(). Unpin all zero pages in the batch here.
>>> + */
>>> + for (i = 0 ; i < ret; i++) {
>>> + if (unlikely(is_zero_pfn(page_to_pfn(pages[i]))))
>>> + unpin_user_page(pages[i]);
>>> + }
>>> +
>>> *pfn = page_to_pfn(pages[0]);
>>> goto done;
>>> }
>>>
>>>
>>
>> As discussed offline, for the shared zeropage (that's not even
>> refcounted when mapped into a process), this makes perfect sense to me.
>>
>> Good question raised by Sean if ZONE_DEVICE pages might similarly be
>> problematic. But for them, we cannot simply always unpin here.
>
> What sort of VM mapping would give me ZONE_DEVICE pages? Thanks,

I think one approach is mmap'ing a devdax device. To test without actual
NVDIMM hardware, there are ways to simulate it even on bare metal using
the "memmap=" kernel parameter.

https://nvdimm.wiki.kernel.org/

Alternatively, you can use an emulated nvdimm device under QEMU -- but
then you'd have to run VFIO inside the VM. I know (that you know) that
there are ways to get that working, but it certainly requires more effort :)

... let me know if you need any tips&tricks.

--
Thanks,

David / dhildenb