答复: [PATCH] vfio/pci: make the vfio_pci_mmap_fault reentrant

From: Zengtao (B)
Date: Mon Mar 08 2021 - 22:50:08 EST


Hi guys:

Thanks for the helpful comments, after rethinking the issue, I have proposed
the following change:
1. follow_pte instead of follow_pfn.
2. vmf_insert_pfn loops instead of io_remap_pfn_range
3. proper undos when some call fails.
4. keep the bigger lock range to avoid unessary pte install.

please help to take a look and get your comments, thanks.

static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
{
struct vm_area_struct *vma = vmf->vma;
struct vfio_pci_device *vdev = vma->vm_private_data;
vm_fault_t ret = VM_FAULT_NOPAGE;
unsigned long vaddr, pfn;
pte_t *ptep;
spinlock_t *ptl;

mutex_lock(&vdev->vma_lock);
down_read(&vdev->memory_lock);

if (!__vfio_pci_memory_enabled(vdev)) {
ret = VM_FAULT_SIGBUS;
goto up_out;
}

if (!follow_pte(vma->vm_mm, vma->vm_start, &ptep, &ptl))
goto up_out;

for (vaddr = vma->start, pfn = vma->vm_pgoff; vaddr < vma->end;) {
ret = vmf_insert_pfn(vma, vaddr, pfn);
if (ret)
goto zap_vma;
vaddr += PAGE_SIZE;
pfn += 1;
}

if (__vfio_pci_add_vma(vdev, vma)) {
ret = VM_FAULT_OOM;
goto zap_vma;
}

mutex_unlock(&vdev->vma_lock);
up_read(&vdev->memory_lock);
return ret;

zap_vma:
zap_vma_ptes(vma, vma->vm_start, vaddr - vma->vm_start);
up_out:
mutex_unlock(&vdev->vma_lock);
up_read(&vdev->memory_lock);
return ret;
}

> -----邮件原件-----
> 发件人: Peter Xu [mailto:peterx@xxxxxxxxxx]
> 发送时间: 2021年3月9日 6:56
> 收件人: Alex Williamson <alex.williamson@xxxxxxxxxx>
> 抄送: Zeng Tao <prime.zeng@xxxxxxxxxxxxx>; linuxarm@xxxxxxxxxx; Cornelia
> Huck <cohuck@xxxxxxxxxx>; Kevin Tian <kevin.tian@xxxxxxxxx>; Andrew
> Morton <akpm@xxxxxxxxxxxxxxxxxxxx>; Giovanni Cabiddu
> <giovanni.cabiddu@xxxxxxxxx>; Michel Lespinasse <walken@xxxxxxxxxx>; Jann
> Horn <jannh@xxxxxxxxxx>; Max Gurtovoy <mgurtovoy@xxxxxxxxxx>;
> kvm@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; Jason Gunthorpe
> <jgg@xxxxxxxxxx>
> 主题: Re: [PATCH] vfio/pci: make the vfio_pci_mmap_fault reentrant
>
> On Mon, Mar 08, 2021 at 01:21:06PM -0700, Alex Williamson wrote:
> > On Mon, 8 Mar 2021 19:11:26 +0800
> > Zeng Tao <prime.zeng@xxxxxxxxxxxxx> wrote:
> >
> > > We have met the following error when test with DPDK testpmd:
> > > [ 1591.733256] kernel BUG at mm/memory.c:2177!
> > > [ 1591.739515] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP [
> > > 1591.747381] Modules linked in: vfio_iommu_type1 vfio_pci
> > > vfio_virqfd vfio pv680_mii(O) [ 1591.760536] CPU: 2 PID: 227 Comm:
> > > lcore-worker-2 Tainted: G O 5.11.0-rc3+ #1 [ 1591.770735] Hardware
> > > name: , BIOS HixxxxFPGA 1P B600 V121-1 [ 1591.778872] pstate:
> > > 40400009 (nZcv daif +PAN -UAO -TCO BTYPE=--) [ 1591.786134] pc :
> > > remap_pfn_range+0x214/0x340 [ 1591.793564] lr :
> > > remap_pfn_range+0x1b8/0x340 [ 1591.799117] sp : ffff80001068bbd0 [
> > > 1591.803476] x29: ffff80001068bbd0 x28: 0000042eff6f0000 [
> > > 1591.810404] x27: 0000001100910000 x26: 0000001300910000 [
> > > 1591.817457] x25: 0068000000000fd3 x24: ffffa92f1338e358 [
> > > 1591.825144] x23: 0000001140000000 x22: 0000000000000041 [
> > > 1591.832506] x21: 0000001300910000 x20: ffffa92f141a4000 [
> > > 1591.839520] x19: 0000001100a00000 x18: 0000000000000000 [
> > > 1591.846108] x17: 0000000000000000 x16: ffffa92f11844540 [
> > > 1591.853570] x15: 0000000000000000 x14: 0000000000000000 [
> > > 1591.860768] x13: fffffc0000000000 x12: 0000000000000880 [
> > > 1591.868053] x11: ffff0821bf3d01d0 x10: ffff5ef2abd89000 [
> > > 1591.875932] x9 : ffffa92f12ab0064 x8 : ffffa92f136471c0 [
> > > 1591.883208] x7 : 0000001140910000 x6 : 0000000200000000 [
> > > 1591.890177] x5 : 0000000000000001 x4 : 0000000000000001 [
> > > 1591.896656] x3 : 0000000000000000 x2 : 0168044000000fd3 [
> > > 1591.903215] x1 : ffff082126261880 x0 : fffffc2084989868 [
> > > 1591.910234] Call trace:
> > > [ 1591.914837] remap_pfn_range+0x214/0x340 [ 1591.921765]
> > > vfio_pci_mmap_fault+0xac/0x130 [vfio_pci] [ 1591.931200]
> > > __do_fault+0x44/0x12c [ 1591.937031] handle_mm_fault+0xcc8/0x1230 [
> > > 1591.942475] do_page_fault+0x16c/0x484 [ 1591.948635]
> > > do_translation_fault+0xbc/0xd8 [ 1591.954171]
> > > do_mem_abort+0x4c/0xc0 [ 1591.960316] el0_da+0x40/0x80 [
> > > 1591.965585] el0_sync_handler+0x168/0x1b0 [ 1591.971608]
> > > el0_sync+0x174/0x180 [ 1591.978312] Code: eb1b027f 540000c0 f9400022
> > > b4fffe02 (d4210000)
> > >
> > > The cause is that the vfio_pci_mmap_fault function is not reentrant,
> > > if multiple threads access the same address which will lead to a
> > > page fault at the same time, we will have the above error.
> > >
> > > Fix the issue by making the vfio_pci_mmap_fault reentrant, and there
> > > is another issue that when the io_remap_pfn_range fails, we need to
> > > undo the __vfio_pci_add_vma, fix it by moving the __vfio_pci_add_vma
> > > down after the io_remap_pfn_range.
> > >
> > > Fixes: 11c4cd07ba11 ("vfio-pci: Fault mmaps to enable vma tracking")
> > > Signed-off-by: Zeng Tao <prime.zeng@xxxxxxxxxxxxx>
> > > ---
> > > drivers/vfio/pci/vfio_pci.c | 14 ++++++++++----
> > > 1 file changed, 10 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/drivers/vfio/pci/vfio_pci.c
> > > b/drivers/vfio/pci/vfio_pci.c index 65e7e6b..6928c37 100644
> > > --- a/drivers/vfio/pci/vfio_pci.c
> > > +++ b/drivers/vfio/pci/vfio_pci.c
> > > @@ -1613,6 +1613,7 @@ static vm_fault_t vfio_pci_mmap_fault(struct
> vm_fault *vmf)
> > > struct vm_area_struct *vma = vmf->vma;
> > > struct vfio_pci_device *vdev = vma->vm_private_data;
> > > vm_fault_t ret = VM_FAULT_NOPAGE;
> > > + unsigned long pfn;
> > >
> > > mutex_lock(&vdev->vma_lock);
> > > down_read(&vdev->memory_lock);
> > > @@ -1623,18 +1624,23 @@ static vm_fault_t vfio_pci_mmap_fault(struct
> vm_fault *vmf)
> > > goto up_out;
> > > }
> > >
> > > - if (__vfio_pci_add_vma(vdev, vma)) {
> > > - ret = VM_FAULT_OOM;
> > > + if (!follow_pfn(vma, vma->vm_start, &pfn)) {
> > > mutex_unlock(&vdev->vma_lock);
> > > goto up_out;
> > > }
> > >
> > > - mutex_unlock(&vdev->vma_lock);
> >
> >
> > If I understand correctly, I think you're using (perhaps slightly
> > abusing) the vma_lock to extend the serialization of the vma_list
> > manipulation to include io_remap_pfn_range() such that you can test
> > whether the pte has already been populated using follow_pfn(). In
> > that case we return VM_FAULT_NOPAGE without trying to repopulate the
> > page and therefore avoid the BUG_ON in remap_pte_range() triggered by
> > trying to overwrite an existing pte, and less importantly, a duplicate
> > vma in our list. I wonder if use of follow_pfn() is still strongly
> > discouraged for this use case.
> >
> > I'm surprised that it's left to the fault handler to provide this
> > serialization, is this because we're filling the entire vma rather
> > than only the faulting page?
>
> There's definitely some kind of serialization in the process using pgtable locks,
> which gives me the feeling that the BUG_ON() in remap_pte_range() seems too
> strong on "!pte_none(*pte)" rather than -EEXIST.
>
> However there'll still be the issue of duplicated vma in vma_list - that seems to
> be a sign that it's still better to fix it from vfio layer.
>
> >
> > As we move to unmap_mapping_range()[1] we remove all of the complexity
> > of managing a list of vmas to zap based on whether device memory is
> > enabled, including the vma_lock. Are we going to need to replace that
> > with another lock here, or is there a better approach to handling
> > concurrency of this fault handler? Jason/Peter? Thanks,
>
> Not looked into the new series of unmap_mapping_range() yet.. But for the
> current code base: instead of follow_pte(), maybe we could simply do the
> ordering by searching the vma list first before inserting into the vma list?
> Because if vma existed, it means the pte installation has done, or at least in
> progress. Then we could return VM_FAULT_RETRY hoping that it'll be done
> soon.
>
> Then maybe it would also make some sense to have vma_lock protect the whole
> io_remap_pfn_range() too? - it'll not be for the ordering, but just that it'll
> guarantee after we're with the vma_lock it means current vma has all ptes
> installed, then the next memory access will guaranteed to success. It seems
> more efficient than multiple VM_FAULT_RETRY page fault looping until it's done.
>
> Thanks,
>
> --
> Peter Xu