Re: [PATCH] iomm/arm-smmu: Add stall implementation hook

From: Will Deacon
Date: Thu May 07 2020 - 08:54:05 EST


On Thu, May 07, 2020 at 11:55:54AM +0100, Robin Murphy wrote:
> On 2020-05-07 11:14 am, Sai Prakash Ranjan wrote:
> > On 2020-04-22 01:50, Sai Prakash Ranjan wrote:
> > > Add stall implementation hook to enable stalling
> > > faults on QCOM platforms which supports it without
> > > causing any kind of hardware mishaps. Without this
> > > on QCOM platforms, GPU faults can cause unrelated
> > > GPU memory accesses to return zeroes. This has the
> > > unfortunate result of command-stream reads from CP
> > > getting invalid data, causing a cascade of fail.
>
> I think this came up before, but something about this rationale doesn't add
> up - we're not *using* stalls at all, we're still terminating faulting
> transactions unconditionally; we're just using CFCFG to terminate them with
> a slight delay, rather than immediately. It's really not clear how or why
> that makes a difference. Is it a GPU bug? Or an SMMU bug? Is this reliable
> (or even a documented workaround for something), or might things start
> blowing up again if any other behaviour subtly changes? I'm not dead set
> against adding this, but I'd *really* like to have a lot more confidence in
> it.

Rob mentioned something about the "bus returning zeroes" before, but I agree
that we need more information so that we can reason about this and maintain
the code as the driver continues to change. That needs to be a comment in
the driver, and I don't think "but android seems to work" is a good enough
justification. There was some interaction with HUPCF as well.

As a template, I'd suggest:

> > > diff --git a/drivers/iommu/arm-smmu.h b/drivers/iommu/arm-smmu.h
> > > index 8d1cd54d82a6..d5134e0d5cce 100644
> > > --- a/drivers/iommu/arm-smmu.h
> > > +++ b/drivers/iommu/arm-smmu.h
> > > @@ -386,6 +386,7 @@ struct arm_smmu_impl {
> > >      int (*init_context)(struct arm_smmu_domain *smmu_domain);
> > >      void (*tlb_sync)(struct arm_smmu_device *smmu, int page, int sync,
> > >               int status);

/*
* Stall transactions on a context fault, where they will be terminated
* in response to the resulting IRQ rather than immediately. This should
* pretty much always be set to "false" as stalling can introduce the
* potential for deadlock in most SoCs, however it is needed on Qualcomm
* XXXX because YYYY.
*/

> > > +    bool stall;

Hmm, the more I think about this, the more I think this is an erratum
workaround in disguise, in which case this could be better named...

Will