RE: [EXT] Re: vfio-pci: protect remap_pfn_range() from simultaneous calls

From: Bharat Bhushan
Date: Mon Mar 08 2021 - 02:03:56 EST




> -----Original Message-----
> From: Ankur Arora <ankur.a.arora@xxxxxxxxxx>
> Sent: Monday, March 8, 2021 12:29 PM
> To: Bharat Bhushan <bbhushan2@xxxxxxxxxxx>
> Cc: alex.williamson@xxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; Sunil Kovvuri
> Goutham <sgoutham@xxxxxxxxxxx>; terminus@xxxxxxxxx
> Subject: Re: [EXT] Re: vfio-pci: protect remap_pfn_range() from simultaneous
> calls
>
> On 2021-03-02 4:47 a.m., Bharat Bhushan wrote:
> > Hi Ankur,
> >
> >> -----Original Message-----
> >> From: Ankur Arora <ankur.a.arora@xxxxxxxxxx>
> >> Sent: Friday, February 26, 2021 6:24 AM
> >> To: Bharat Bhushan <bbhushan2@xxxxxxxxxxx>
> >> Cc: alex.williamson@xxxxxxxxxx; ankur.a.arora@xxxxxxxxxx; linux-
> >> kernel@xxxxxxxxxxxxxxx; Sunil Kovvuri Goutham <sgoutham@xxxxxxxxxxx>;
> >> terminus@xxxxxxxxx
> >> Subject: [EXT] Re: vfio-pci: protect remap_pfn_range() from
> >> simultaneous calls
> >>
> >> External Email
> >>
> >> ---------------------------------------------------------------------
> >> -
> >> Hi Bharat,
> >>
> >> Can you test the patch below to see if it works for you?
> >
> > Sorry for late reply, I actually missed this email.
> >
> > Reproducibility of the issue was low in my test scenario, one out of ~15 runs. I
> run it multiple times, overnight and observed no issues.
>
> Awesome. Would you mind giving me your Tested-by for the patch?

Sure, please point if this is already sent for review.

>
> >> Also could you add some more detail to your earlier description of
> >> the bug?
> >
> > Our test case is running ODP multi-threaded application, where parent process
> maps (yes it uses MAP_DMA) the region and then child processes access same.
> As a workaround we tried accessing the region once by parent process before
> creating other accessor threads and it worked as expected.
>
> Thanks for the detail. So if the child processes start early -- they might fault while
> the VFIO_IOMMU_MAP_DMA was going on. And, since they only acquire
> mmap_lock in RO mode, both paths would end up calling io_remap_pfn_range()
> via the fault handler.

Yes, that's correct.

Thanks
-Bharat

>
> Thanks
> Ankur
>
> >
> > Thanks
> > -Bharat
> >
> >> In particular, AFAICS you are using ODP (-DPDK?) with multiple
> >> threads touching this region. From your stack, it looks like the
> >> fault was user-space generated, and I'm guessing you were not using
> >> the VFIO_IOMMU_MAP_DMA.
> >>
> >> Ankur
> >>
> >> -- >8 --
> >>
> >> Subject: [PATCH] vfio-pci: protect io_remap_pfn_range() from
> >> simultaneous calls
> >>
> >> vfio_pci_mmap_fault() maps the complete VMA on fault. With concurrent
> >> faults, this would result in multiple calls to io_remap_pfn_range(),
> >> where it would hit a BUG_ON(!pte_none(*pte)) in remap_pte_range().
> >> (It would also link the same VMA multiple times in vdev->vma_list but
> >> given the BUG_ON that is less serious.)
> >>
> >> Normally, however, this won't happen -- at least with
> >> vfio_iommu_type1 -- the VFIO_IOMMU_MAP_DMA path is protected by
> iommu->lock.
> >>
> >> If, however, we are using some kind of parallelization mechanism like
> >> this one with ktask under discussion [1], we would hit this.
> >> Even if we were doing this serially, given that vfio-pci remaps a
> >> larger extent than strictly necessary it should internally enforce
> >> coherence of its data structures.
> >>
> >> Handle this by using the VMA's presence in the vdev->vma_list as
> >> indicative of a fully mapped VMA and returning success early to all
> >> but the first VMA fault. Note that this is clearly optimstic given
> >> that the mapping is ongoing, and might mean that the caller sees more
> >> faults until the remap is done.
> >>
> >> [1]
> >> https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_
> >> linux-
> >> 2Dmm_20181105145141.6f9937f6-
> >>
> 40w520.home_&d=DwIDAg&c=nKjWec2b6R0mOyPaz7xtfQ&r=PAAlWswPe7d8gHl
> >>
> GbCLmy2YezyK7O3Hv_t2heGnouBw&m=3ZDXqnn9xNUCjgXwN9mHIKT7oyXu55P
> >> U7yV2j0b-5hw&s=hiICkNtrcH4AbAWRrbkvMUylp7Bv0YHFCjxNGC6CGOk&e=
> >>
> >> Signed-off-by: Ankur Arora <ankur.a.arora@xxxxxxxxxx>
> >> ---
> >> drivers/vfio/pci/vfio_pci.c | 25 ++++++++++++++++++++++++-
> >> 1 file changed, 24 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/vfio/pci/vfio_pci.c
> >> b/drivers/vfio/pci/vfio_pci.c index 65e7e6b44578..b9f509863db1 100644
> >> --- a/drivers/vfio/pci/vfio_pci.c
> >> +++ b/drivers/vfio/pci/vfio_pci.c
> >> @@ -1573,6 +1573,11 @@ static int __vfio_pci_add_vma(struct
> >> vfio_pci_device *vdev,
> >> {
> >> struct vfio_pci_mmap_vma *mmap_vma;
> >>
> >> + list_for_each_entry(mmap_vma, &vdev->vma_list, vma_next) {
> >> + if (mmap_vma->vma == vma)
> >> + return 1;
> >> + }
> >> +
> >> mmap_vma = kmalloc(sizeof(*mmap_vma), GFP_KERNEL);
> >> if (!mmap_vma)
> >> return -ENOMEM;
> >> @@ -1613,6 +1618,7 @@ static vm_fault_t vfio_pci_mmap_fault(struct
> >> vm_fault
> >> *vmf)
> >> struct vm_area_struct *vma = vmf->vma;
> >> struct vfio_pci_device *vdev = vma->vm_private_data;
> >> vm_fault_t ret = VM_FAULT_NOPAGE;
> >> + int vma_present;
> >>
> >> mutex_lock(&vdev->vma_lock);
> >> down_read(&vdev->memory_lock);
> >> @@ -1623,7 +1629,21 @@ static vm_fault_t vfio_pci_mmap_fault(struct
> >> vm_fault *vmf)
> >> goto up_out;
> >> }
> >>
> >> - if (__vfio_pci_add_vma(vdev, vma)) {
> >> + /*
> >> + * __vfio_pci_add_vma() either adds the vma to the vdev->vma_list
> >> + * (vma_present == 0), or indicates that the vma is already present
> >> + * on the list (vma_present == 1).
> >> + *
> >> + * Overload the meaning of this flag to also imply that the vma is
> >> + * fully mapped. This allows us to serialize the mapping -- ensuring
> >> + * that simultaneous faults will not both try to call
> >> + * io_remap_pfn_range().
> >> + *
> >> + * However, this might mean that callers to which we returned success
> >> + * optimistically will see more faults until the remap is complete.
> >> + */
> >> + vma_present = __vfio_pci_add_vma(vdev, vma);
> >> + if (vma_present < 0) {
> >> ret = VM_FAULT_OOM;
> >> mutex_unlock(&vdev->vma_lock);
> >> goto up_out;
> >> @@ -1631,6 +1651,9 @@ static vm_fault_t vfio_pci_mmap_fault(struct
> >> vm_fault
> >> *vmf)
> >>
> >> mutex_unlock(&vdev->vma_lock);
> >>
> >> + if (vma_present)
> >> + goto up_out;
> >> +
> >> if (io_remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
> >> vma->vm_end - vma->vm_start, vma-
> >>> vm_page_prot))
> >> ret = VM_FAULT_SIGBUS;
> >> --
> >> 2.29.2
> >