Re: Kernel 6.7 regression doesn't boot if using AMD eGPU

From: Jason Gunthorpe
Date: Tue Apr 16 2024 - 07:39:48 EST


On Tue, Apr 16, 2024 at 04:23:38PM +0530, Vasant Hegde wrote:
> > Yes, if the driver wants to force a domain type it should be
> > forced. Driver knows best. Eg AMD forces IDENTITY when the HW/driver
> > is incapable of supporting otherwise.
>
>
> @Jason,
>
> Looks like before commit 59ddce4418da483, core IOMMU layer was enforcing
> 'IOMMU_DOMAIN_DMA' for untrusted device. If its trusted device then it was
> letting HW IOMMU driver to pick best domain type.

If the driver wants to force identity because paging doesn't work then
yes it needs to figure something out..

Really the drivers should not be designed to do this, they need to
accommodate paging domains in all cases if things are going to work
correctly. The def_domain callback should be a last resort.

> > diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
> > index d35c1b8c8e65ce..f3da6a5b6cb1cb 100644
> > --- a/drivers/iommu/amd/iommu.c
> > +++ b/drivers/iommu/amd/iommu.c
> > @@ -2758,11 +2758,16 @@ static int amd_iommu_def_domain_type(struct device *dev)
> >          *    and require remapping.
> >          *  - SNP is enabled, because it prohibits DTE[Mode]=0.
> >          */
> > -       if (pdev_pasid_supported(dev_data) &&
> > -           !cc_platform_has(CC_ATTR_MEM_ENCRYPT) &&
> > -           !amd_iommu_snp_en) {
> > +       if (!cc_platform_has(CC_ATTR_MEM_ENCRYPT) && !amd_iommu_snp_en)
> > +               return IOMMU_DOMAIN_IDENTITY;
> > +
> > +       /*
> > +        * For now driver limitations prevent enabling PASID as a paging domain
> > +        * on the RID together.
> > +        */
> > +       if (dev_is_pci(dev) && !to_pci_dev(dev)->untrusted &&
> > +           pdev_pasid_supported(dev_data))
> >                 return IOMMU_DOMAIN_IDENTITY;
> > -       }
> >
> >         return 0;
> >  }
> >
> >
> > As it booted ok with Robin's patch above, these changes to
> > drivers/iommu/amd/iommu.c didn't seem to make a difference for me. I was
> > testing with amd iommu v2 off in the kernel config and I also have TSME
> > enabled in the BIOS if that matters.

There must be a mistake in the above then, it would be good to sort it
out because something like that is the right fix.

Jason