Re: [PATCH 1/8] iommu: Decouple iommu_present() from bus ops

From: Robin Murphy
Date: Mon Jan 30 2023 - 08:49:45 EST


On 2023-01-28 08:49, Baolu Lu wrote:
On 2023/1/27 23:43, Jason Gunthorpe wrote:
On Fri, Jan 27, 2023 at 03:19:55PM +0000, Robin Murphy wrote:

The current implementation of device_iommu_mapped() just dates back to when
dev->iommu_group was the only per-device thing we had, so in principle I
don't have any conceptual objection to redefining it in terms of "device has
ops" rather than "device has a group", but as things stand you'd still have
to do something about PPC first (I know Jason had been pushing on that, but
I've not kept track of where it got to).
PPC hasn't moved at all, AFAICT. In a few more months I'm going to
suggest we delete the special VFIO support due to it being broken,
distros already having turned it off and nobody caring enough to fix
it..

What does device_iommu_mapped() even really mean?

Looking at usages..

These are fixing SOC HW bugs/issues - the desire seems to be "is the SOC's
IOMMU enabled"

drivers/char/agp/intel-gtt.c: device_iommu_mapped(&intel_private.pcidev->dev));
drivers/dma/sh/rcar-dmac.c:     if (device_iommu_mapped(&pdev->dev))
drivers/gpu/drm/i915/i915_utils.c:      if (device_iommu_mapped(i915->drm.dev))
?
drivers/usb/dwc3/dwc3-xilinx.c: if (of_dma_is_coherent(dev->of_node) || device_iommu_mapped(dev)) {
drivers/usb/host/xhci.c:        if (!(xhci->quirks & XHCI_ZERO_64B_REGS) || !device_iommu_mapped(dev))
drivers/crypto/qat/qat_common/adf_sriov.c:      if (!device_iommu_mapped(&pdev->dev))
?

These seem to be trying to decide if iommu_domain's can be used (and
they can't be on power):

drivers/gpu/drm/msm/msm_drv.c:  if (device_iommu_mapped(mdp_dev))
drivers/gpu/drm/msm/msm_drv.c:          device_iommu_mapped(dev->dev) ||
drivers/gpu/drm/msm/msm_drv.c: device_iommu_mapped(dev->dev->parent);
drivers/gpu/drm/nouveau/nvkm/engine/device/tegra.c:     if (device_iommu_mapped(dev)) {
drivers/gpu/drm/rockchip/rockchip_drm_drv.c:    if (!device_iommu_mapped(dev))
drivers/gpu/drm/tegra/uapi.c:   if (device_iommu_mapped(client->base.dev) && client->ops->can_use_memory_ctx) {
drivers/gpu/host1x/context.c:           if (!fwspec || !device_iommu_mapped(&ctx->dev)) {
drivers/infiniband/hw/usnic/usnic_ib_main.c:    if (!device_iommu_mapped(&pdev->dev)) {

Yikes, trying to map DMA addresses programmed into devices back to CPU addresses:

drivers/misc/habanalabs/common/debugfs.c: if (!user_address || device_iommu_mapped(&hdev->pdev->dev)) {
drivers/misc/habanalabs/gaudi2/gaudi2.c:                if (!device_iommu_mapped(&hdev->pdev->dev))

And then sequencing the call to iommu_probe_device() which doesn't
apply to power:

drivers/acpi/scan.c:    if (!err && dev->bus && !device_iommu_mapped(dev))
drivers/iommu/of_iommu.c:       if (!err && dev->bus && !device_iommu_mapped(dev))

Leaving these:

arch/powerpc/kernel/eeh.c:      if (device_iommu_mapped(dev)) {

This is only used to support eeh_iommu_group_to_pe which is only
caleld by vfio_iommu_spapr_tce.c. Since power vfio doesn't work right
now this is uncallable, and when power is fixed this will work
properly.

Oh wow, I should have looked at more context... Even better, this one is already just an elaborate "if (true)" - it has been impossible for dev_has_iommu_table() to return 0 since at least 2015 :D
arch/powerpc/kernel/iommu.c:    if (device_iommu_mapped(dev)) {
arch/powerpc/kernel/iommu.c:    if (!device_iommu_mapped(dev)) {

These should both be replaced with some kind of 'device has iommu group', since
it is really driving ppc unique group logic.

And in fact those appear to mostly serve for printing debug messages; if we made iommu_group_add_device() return -EBUSY for duplicate calls (not necessarily a bad idea anyway vs. relying on the noisy failure of sysfs_create_link()), they could arguably just go.

All in all, it's only actually the habanalabs ones that I'm slightly wary of, since they're effectively (mis)using device_iommu_mapped() to infer the DMA ops implementation, which could potentially go wrong (or at least *more* wrong) on POWER with this change. I guess the saving grace is that although they are available on PCIe-interfaced modules, the userspace driver stack seems to be implicitly x86_64-only - as far as I could tell from a quick poke around their site and documentation, which doesn't appear to acknowledge the concept of CPU architectures at all - so the chances of anyone actually trying to use the kernel drivers in anger on POWER seem minimal.

Thanks,
Robin.

So, I'd say Baolu's approach is the right thing, just replace the
above two in ppc with something else.

Thank you both. I will follow up a series later.

Best regards,
baolu