Re: [PATCH RFC 0/3] Fix KVM misinterpreting Reserved page as an MMIO page

From: Alexander Duyck
Date: Tue Dec 04 2018 - 13:45:11 EST


On Tue, 2018-12-04 at 14:59 +0800, Yi Zhang wrote:
> On 2018-12-03 at 11:25:20 -0800, Alexander Duyck wrote:
> > I have loosely based this patch series off of the following patch series
> > from Zhang Yi:
> > https://lore.kernel.org/lkml/cover.1536342881.git.yi.z.zhang@xxxxxxxxxxxxxxx
> >
> > The original set had attempted to address the fact that DAX pages were
> > treated like MMIO pages which had resulted in reduced performance. It
> > attempted to address this by ignoring the PageReserved flag if the page
> > was either a DEV_DAX or FS_DAX page.
> >
> > I am proposing this as an alternative to that set. The main reason for this
> > is because I believe there are a few issues that were overlooked with that
> > original set. Specifically KVM seems to have two different uses for the
> > PageReserved flag. One being whether or not we can pin the memory, the other
> > being if we should be marking the pages as dirty or accessed. I believe
> > only the pinning really applies so I have split the uses of
> > kvm_is_reserved_pfn and updated the function uses to determine support for
> > page pinning to include a check of the pgmap to see if it supports pinning.
>
> kvm is not the only one users of the dax page.

Yes, but KVM and virtualization in general seems to be the place where
the code carrying the assumption that PageReserved == MMIO exists.

> A similar user of PageReserved to look at is:
> drivers/vfio/vfio_iommu_type1.c:is_invalid_reserved_pfn(
> vfio is also want to know the page is capable for pinning.

I would lump vfio in with virtualization as I said above.

A quick search also shows that there is also
arch/x86/kvm/mmu.c:kvm_is_mmio_pfn() which had a similar assumption but
is already carrying workarounds.

> I throught that you have removed the reserved flag on the dax page
>
> in https://patchwork.kernel.org/patch/10707267/
>
> is something I missing here?

That patch wasn't about DAX memory. That patch was about the fact that
the reserved flag was expensive as a __set_bit operation. I was leaving
the bit set for DAX and all other hot-plug memory and not setting it
for deferred memory init.

The reserved bit is essentially meant to flag everything that is not
standard system RAM page. Historically speaking most of that was MMIO,
now that isn't necessarily the case with the introduction of
ZONE_DEVICE pages.

The issue is DAX isn't necessarily system RAM either. So if we don't
set the reserved bit for DAX then we have to go through and start
adding exception cases to the paths that handle system RAM to split it
off from DAX. Dan had pointed out one such example in
kernel/power/snapshot.c:saveable_page() as I recall.

> >
> > ---
> >
> > Alexander Duyck (3):
> > kvm: Split use cases for kvm_is_reserved_pfn to kvm_is_refcounted_pfn
> > mm: Add support for exposing if dev_pagemap supports refcount pinning
> > kvm: Add additional check to determine if a page is refcounted
> >
> >
> > arch/x86/kvm/mmu.c | 6 +++---
> > drivers/nvdimm/pfn_devs.c | 2 ++
> > include/linux/kvm_host.h | 2 +-
> > include/linux/memremap.h | 5 ++++-
> > include/linux/mm.h | 11 +++++++++++
> > virt/kvm/kvm_main.c | 34 +++++++++++++++++++++++++---------
> > 6 files changed, 46 insertions(+), 14 deletions(-)
> >
> > --