Re: [RFC PATCH 05/10] vfio: Create a vfio_device from vma lookup

From: Jason Gunthorpe
Date: Thu Mar 04 2021 - 19:36:48 EST


On Thu, Mar 04, 2021 at 05:07:31PM -0700, Alex Williamson wrote:
> On Thu, 4 Mar 2021 19:16:33 -0400
> Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:
>
> > On Thu, Mar 04, 2021 at 02:37:57PM -0700, Alex Williamson wrote:
> >
> > > Therefore unless a bus driver opts-out by replacing vm_private_data, we
> > > can identify participating vmas by the vm_ops and have flags indicating
> > > if the vma maps device memory such that vfio_get_device_from_vma()
> > > should produce a device reference. The vfio IOMMU backends would also
> > > consume this, ie. if they get a valid vfio_device from the vma, use the
> > > pfn_base field directly. vfio_vm_ops would wrap the bus driver
> > > callbacks and provide reference counting on open/close to release this
> > > object.
> >
> > > I'm not thrilled with a vfio_device_ops callback plumbed through
> > > vfio-core to do vma-to-pfn translation, so I thought this might be a
> > > better alternative. Thanks,
> >
> > Maybe you could explain why, because I'm looking at this idea and
> > thinking it looks very complicated compared to a simple driver op
> > callback?
>
> vfio-core needs to export a vfio_vma_to_pfn() which I think assumes the
> caller has already used vfio_device_get_from_vma(), but should still
> validate the vma is one from a vfio device before calling this new
> vfio_device_ops callback.

Huh? Validate? Why?

Something like this in the IOMMU stuff:

struct vfio_device *vfio = vfio_device_get_from_vma(vma)

if (!vfio->vma_to_pfn)
return -EINVAL;
vfio->ops->vma_to_pfn(vfio, vma, offset_from_vma_start)

Is fine, why do we need to over complicate?

I don't need to check that the vma belongs to the vfio because it is
the API contract that the caller will guarantee that.

This is the kernel, I can (and do!) check these sorts of things by
code inspection when working on stuff - I can look at every
implementation and every call site to prove these things.

IMHO doing an expensive check like that is a style of defensive
programming the kernel community frowns upon.

> vfio-pci needs to validate the vm_pgoff value falls within a BAR
> region, mask off the index and get the pci_resource_start() for the
> BAR index.

It needs to do the same thing fault() already does, which is currently
one line..

> Then we need a solution for how vfio_device_get_from_vma() determines
> whether to grant a device reference for a given vma, where that vma may
> map something other than device memory. Are you imagining that we hand
> out device references independently and vfio_vma_to_pfn() would return
> an errno for vm_pgoff values that don't map device memory and the IOMMU
> driver would release the reference?

That seems a reasonable place to start

> prevent using unmmap_mapping_range(). The IOMMU backend, once it has a
> vfio_device via vfio_device_get_from_vma() can know the format of
> vm_private_data, cast it as a vfio_vma_private_data and directly use
> base_pfn, accomplishing the big point. They're all operating in the
> agreed upon vm_private_data format. Thanks,

If we force all drivers into a mandatory (!) vm_private_data format
then every driver has to be audited and updated before the new pfn
code can be done. If any driver in the future makes a mistake here
(and omitting the unique vm_private_data magics is a very easy mistake
to make) then it will cause a kernel crash in an obscure scenario.

It is the "design the API to be hard to use wrong" philosophy.

Jason