Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed

From: Neo Jia
Date: Tue Jul 05 2016 - 03:30:42 EST


On Tue, Jul 05, 2016 at 02:26:46PM +0800, Xiao Guangrong wrote:
>
>
> On 07/05/2016 01:16 PM, Neo Jia wrote:
> >On Tue, Jul 05, 2016 at 12:02:42PM +0800, Xiao Guangrong wrote:
> >>
> >>
> >>On 07/05/2016 09:35 AM, Neo Jia wrote:
> >>>On Tue, Jul 05, 2016 at 09:19:40AM +0800, Xiao Guangrong wrote:
> >>>>
> >>>>
> >>>>On 07/04/2016 11:33 PM, Neo Jia wrote:
> >>>>
> >>>>>>>
> >>>>>>>Sorry, I think I misread the "allocation" as "mapping". We only delay the
> >>>>>>>cpu mapping, not the allocation.
> >>>>>>
> >>>>>>So how to understand your statement:
> >>>>>>"at that moment nobody has any knowledge about how the physical mmio gets virtualized"
> >>>>>>
> >>>>>>The resource, physical MMIO region, has been allocated, why we do not know the physical
> >>>>>>address mapped to the VM?
> >>>>>>
> >>>>>
> >>>>>>From a device driver point of view, the physical mmio region never gets allocated until
> >>>>>the corresponding resource is requested by clients and granted by the mediated device driver.
> >>>>
> >>>>Hmm... but you told me that you did not delay the allocation. :(
> >>>
> >>>Hi Guangrong,
> >>>
> >>>The allocation here is the allocation of device resource, and the only way to
> >>>access that kind of device resource is via a mmio region of some pages there.
> >>>
> >>>For example, if VM needs resource A, and the only way to access resource A is
> >>>via some kind of device memory at mmio address X.
> >>>
> >>>So, we never defer the allocation request during runtime, we just setup the
> >>>CPU mapping later when it actually gets accessed.
> >>>
> >>>>
> >>>>So it returns to my original question: why not allocate the physical mmio region in mmap()?
> >>>>
> >>>
> >>>Without running anything inside the VM, how do you know how the hw resource gets
> >>>allocated, therefore no knowledge of the use of mmio region.
> >>
> >>The allocation and mapping can be two independent processes:
> >>- the first process is just allocation. The MMIO region is allocated from physical
> >> hardware and this region is mapped into _QEMU's_ arbitrary virtual address by mmap().
> >> At this time, VM can not actually use this resource.
> >>
> >>- the second process is mapping. When VM enable this region, e.g, it enables the
> >> PCI BAR, then QEMU maps its virtual address returned by mmap() to VM's physical
> >> memory. After that, VM can access this region.
> >>
> >>The second process is completed handled in userspace, that means, the mediated
> >>device driver needn't care how the resource is mapped into VM.
> >
> >In your example, you are still picturing it as VFIO direct assign, but the solution we are
> >talking here is mediated passthru via VFIO framework to virtualize DMA devices without SR-IOV.
> >
>
> Please see my comments below.
>
> >(Just for completeness, if you really want to use a device in above example as
> >VFIO passthru, the second step is not completely handled in userspace, it is actually the guest
> >driver who will allocate and setup the proper hw resource which will later ready
> >for you to access via some mmio pages.)
>
> Hmm... i always treat the VM as userspace.

It is OK to treat VM as userspace, but I think it is better to put out details
so we are always on the same page.

>
> >
> >>
> >>This is how QEMU/VFIO currently works, could you please tell me the special points
> >>of your solution comparing with current QEMU/VFIO and why current model can not fit
> >>your requirement? So that we can better understand your scenario?
> >
> >The scenario I am describing here is mediated passthru case, but what you are
> >describing here (more or less) is VFIO direct assigned case. It is different in several
> >areas, but major difference related to this topic here is:
> >
> >1) In VFIO direct assigned case, the device (and its resource) is completely owned by the VM
> >therefore its mmio region can be mapped directly into the VM during the VFIO mmap() call as
> >there is no resource sharing among VMs and there is no mediated device driver on
> >the host to manage such resource, so it is completely owned by the guest.
>
> I understand this difference, However, as you told to me that the MMIO region allocated for the
> VM is continuous, so i assume the portion of physical MMIO region is completely owned by guest.
> The only difference i can see is mediated device driver need to allocate that region.

It is physically contiguous but it is done during the runtime, physically contiguous doesn't mean
static partition at boot time. And only during runtime, the proper HW resource will be requested therefore
the right portion of MMIO region will be granted by the mediated device driver on the host.

Also, the physically contiguous doesn't mean the guest and host mmio is 1:1
always. You can have a 8GB host physical mmio while the guest will only have
256MB.

>
> >
> >2) In mediated passthru case, multiple VMs are sharing the same physical device, so how
> >the HW resource gets allocated is completely decided by the guest and host device driver of
> >the virtualized DMA device, here is the GPU, same as the MMIO pages used to access those Hw resource.
>
> I can not see what guest's affair is here, look at your code, you cooked the fault handler like
> this:

You shouldn't as that depends on how different devices are getting
para-virtualized by their own implementations.

>
> + ret = parent->ops->validate_map_request(mdev, virtaddr,
> + &pgoff, &req_size,
> + &pg_prot);
>
> Please tell me what information is got from guest? All these info can be found at the time of
> mmap().

The virtaddr is the guest mmio address that triggers this fault, which will be
used for the mediated device driver to locate the resource that he has previously allocated.

Then the req_size and pgoff will both come from the mediated device driver based on his internal book
keeping of the hw resource allocation, which is only available during runtime. And such book keeping
can be built part of para-virtualization scheme between guest and host device driver.

None of such information is available at VFIO mmap() time. For example, several VMs
are sharing the same physical device to provide mediated access. All VMs will
call the VFIO mmap() on their virtual BAR as part of QEMU vfio/pci initialization
process, at that moment, we definitely can't mmap the entire physical MMIO
into both VM blindly for obvious reason.

And the pgoff will be different for different VMs as they will not have access
to others hw resource for the same reason.

Thanks,
Neo