Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed

From: Neo Jia
Date: Wed Jul 06 2016 - 00:08:28 EST


On Wed, Jul 06, 2016 at 10:22:59AM +0800, Xiao Guangrong wrote:
>
>
> On 07/05/2016 11:07 PM, Neo Jia wrote:
> >This is kept there in case the validate_map_request() is not provided by vendor
> >driver then by default assume 1:1 mapping. So if validate_map_request() is not
> >provided, fault handler should not fail.
>
> THESE are the parameters you passed to validate_map_request(), and these info is
> available in mmap(), it really does not matter if you move validate_map_request()
> to mmap(). That's what i want to say.

Let me answer this at the end of my response.

>
> >
> >>
> >>>None of such information is available at VFIO mmap() time. For example, several VMs
> >>>are sharing the same physical device to provide mediated access. All VMs will
> >>>call the VFIO mmap() on their virtual BAR as part of QEMU vfio/pci initialization
> >>>process, at that moment, we definitely can't mmap the entire physical MMIO
> >>>into both VM blindly for obvious reason.
> >>>
> >>
> >>mmap() carries @length information, so you only need to allocate the specified size
> >>(corresponding to @length) of memory for them.
> >
> >Again, you still look at this as a static partition at QEMU configuration time
> >where the guest mmio will be mapped as a whole at some offset of the physical
> >mmio region. (You still can do that like I said above by not providing
> >validate_map_request() in your vendor driver.)
> >
>
> Then you can move validate_map_request() to here to achieve custom allocation-policy.
>
> >But this is not the framework we are defining here.
> >
> >The framework we have here is to provide the driver vendor flexibility to decide
> >the guest mmio and physical mmio mapping on page basis, and such information is
> >available during runtime.
> >
> >How such information gets communicated between guest and host driver is up to
> >driver vendor.
>
> The problems is the sequence of the way "provide the driver vendor
> flexibility to decide the guest mmio and physical mmio mapping on page basis"
> and mmap().
>
> We should provide such allocation info first then do mmap(). You current design,
> do mmap() -> communication telling such info -> use such info when fault happens,
> is really BAD, because you can not control the time when memory fault will happen.
> The guest may access this memory before the communication you mentioned above,
> and another reason is that KVM MMU can prefetch memory at any time.

Like I have said before if your implementation doesn't need such flexibility,
you can still do a static mapping at VFIO mmap() time, then your mediated driver
doesn't have to provide validate_map_request, also the fault handler will not be
called.

Let me address your questions below.

1. Information available at VFIO mmap() time?

So you are saying that the @req_size and &pgoff are both available in the time
when people are calling VFIO mmap() when guest OS is not even running, right?

The answer is No, the only thing are available at VFIO mmap are the following:

1) guest MMIO size

2) host physical MMIO size

3) guest MMIO starting address

4) host MMIO starting address

But none of above are the @req_size and @pgoff that we are talking about at the
validate_map_request time.

Our host MMIO is representing the means to access GPU HW resource. Those GPU HW
resources are allocated dynamically at runtime. So we have no visibility of the
@pgoff and @req_size that is covering some specific type of GPU HW resource at
VFIO mmap time. Also we don't even know if such resource will be required for a
particular VM or not.

For example, VM1 will need to launch a lot of graphics workload than VM2. So the
end result is that VM1 will gets a lot of resource A allocated than VM2 to
support his graphics workload. And to access resource A, the host mmio region
will be allocated as well, say [pfn_a size_a], the VM2 is [pfn_b, size_b].

Clearly, such region can be destroyed and reallocated through mediated driver
lifetime. This is why we need to have a fault handler there to map the proper
pages into guest after validation in the runtime.

I hope above response can address your question why we can't provide such
allocation info at VFIO mmap() time.

2. Guest might access mmio region at any time ...

Guest with a mediated GPU inside can definitely access his BARs at any time. If
guest is accessing some his BAR region that is not previously allocated, then
such access will be denied and with current scheme VM will crash to prevent
malicious access from the guest. This is another reason we choose to keep the
guest MMIO mediated.

3. KVM MMU can prefetch memory at any time.

You are talking about the KVM MMU prefetch the guest mmio region which is marked
as prefetchable right?

On baremetal, the prefetch is basic a cache line fill, where the range needs to
be marked as cachable for the CPU, then it issues a read to anywhere in the
cache line.

Is KVM MMU prefetch the same as baremetal? If so, it is at least *not at any
time" right?

And the prefetch will only happen when there is a valid guest CPU mapping to its
guest mmio region. Then, it goes back to issue (2), if the CPU mapping setup is
done via proper driver and get validated by the mediated device driver, such
prefetch will work as expected. If not, then such prefech is no different than
the malicious or unsupported access from the guest, VM will crash.

Thanks,
Neo