Re: [PATCH 1/1] vfio/type1: Respect IOMMU reserved regions in vfio_test_domain_fgsp()

From: Niklas Schnelle
Date: Wed Jan 04 2023 - 04:53:13 EST


On Tue, 2023-01-03 at 19:39 -0400, Jason Gunthorpe wrote:
> On Mon, Jan 02, 2023 at 10:34:52AM +0100, Niklas Schnelle wrote:
> > Since commit cbf7827bc5dc ("iommu/s390: Fix potential s390_domain
> > aperture shrinking") the s390 IOMMU driver uses a reserved region
> > instead of an artificially shrunk aperture to restrict IOMMU use based
> > on the system provided DMA ranges of devices. In particular on current
> > machines this prevents use of DMA addresses below 2^32 for all devices.
> > While usually just IOMMU mapping below these addresses is
> > harmless. However our virtual ISM PCI device looks at new mappings on
> > IOTLB flush and immediately goes into the error state if such a mapping
> > violates its allowed DMA ranges. This then breaks pass-through of the
> > ISM device to a KVM guest.
> >
> > Analysing this we found that vfio_test_domain_fgsp() maps 2 pages at DMA
> > address 0 irrespective of the IOMMUs reserved regions. Even if usually
> > harmless this seems wrong in the general case so instead go through the
> > freshly updated IOVA list and try to find a range that isn't reserved
> > and fits 2 pages and use that for testing for fine grained super pages.
>
> Why does it matter? The s390 driver will not set fgsp=true, so if it
> fails because map fails or does a proper detection it shouldn't make a
> difference.
>
> IOW how does this actualy manifest into a failure?

Oh, yeah I agree that's what I meant by saying that just mapping should
usually be harmless. This is indeedthe case for all normal PCI devices
on s390 there it doesn't matter. 

The problem manifests only with ISM devices which are a special s390
virtual PCI device that is implemented in the machine hypervisor. This
device is used for high speed cross-LPAR (Logical Partition)
communication, basically it allows two LPARs that previously exchanged
an authentication token to memcpy between their partitioned memory
using the virtual device. For copying a receiving LPAR will IOMMU map a
region of memory for the ISM device that it will allow DMAing into
(memcpy by the hypervisor). All other regions remain unmapped and thus
inaccessible. In preparation the device emulation in the machine
hypervisor intercepts the IOTLB flush and looks at the IOMMU
translation tables performing e.g. size and alignment checks I presume,
one of these checks against the start/end DMA boundaries. This check
fails which leads to the virtual ISM device being put into an error
state. Being in an error state it then fails to be initialized by the
guest driver later on.

>
> > - if (!ret) {
> > - size_t unmapped = iommu_unmap(domain->domain, 0, PAGE_SIZE);
> > + list_for_each_entry(region, regions, list) {
> > + if (region->end - region->start < PAGE_SIZE * 2)
> > + continue;
> >
> > - if (unmapped == PAGE_SIZE)
> > - iommu_unmap(domain->domain, PAGE_SIZE, PAGE_SIZE);
> > - else
> > - domain->fgsp = true;
> > + ret = iommu_map(domain->domain, region->start, page_to_phys(pages), PAGE_SIZE * 2,
> > + IOMMU_READ | IOMMU_WRITE | IOMMU_CACHE);
>
> The region also needs to have 'region->start % (PAGE_SIZE*2) == 0' for the
> test to work
>
> Jason

Ah okay makes sense, I guess that check could easily be added.