Re: [PATCH RESEND 1/1] PCI: Add ATS-disable quirk for AMD Radeon R7 GPUs

From: jroedel@xxxxxxx
Date: Thu Apr 11 2019 - 08:37:02 EST


On Wed, Apr 10, 2019 at 03:59:57PM +0000, Deucher, Alexander wrote:
> > + a few AMD people
> >
> > Seeing this bug makes it more clear. I don't think this is a problem with the
> > GPU. I think it's a problem with either the sbios or iommu. I think the original
> > quirk added for stoney (0x98e4) is probably wrong as well. I suspect we
> > need a quirk for a particular laptop or sbios versions. We validated ATS
> > extensively with Carrizo based systems (the system in the bug report above
> > is Carrizo based) since it is the basis of our ROCm support on APUs. We have
> > also been involved in tons of Linux OEM preloads with both Carrizo and
> > Stoney based APUs in combination with TOPAZ dGPUs (0x6900) and haven't
> > seen this issue in those programs. We also have TOPAZ dGPUs used in OEM
> > programs with Intel chipsets and haven't seen the issue. I suspect since
> > windows does not use the IOMMU by default, the sbios settings may not be
> > well validated on certain windows only skus. I'd rather make these DMI
> > matches or something like that for the platform or at the very least match
> > the SSIDs as well.
>
> Reading through these bugs again it seems to be an issue with Stoney
> APUs, not the dGPU specifically. I think it would be better to
> disable ATS in general if a stoney based platform was detected rather
> than adding ATS quirks for devices then someone may put in a Stoney
> based platform. It also seems to be related to runtime pm on the
> dGPU. Disabling runtime pm also seem to fix the issue. On these
> systems runtime pm for the dGPU is controlled via ACPI (either ATPX or
> _PR3 depending on the platform). Maybe something doesn't get restored
> properly on runtime resume which cases the ATS issues?

This seems all pretty much possible, but we lack the ability to debug
this further on our side. So until we have a real root-cause with a more
specific quirk that only targets systems with a broken sbios or
whatever, we need to catch-all approach.

We can remove these quirks again when AMD sends more specific quirks
upstream.


Regards,

Joerg