Re: [PATCH] coresight: tmc-etr: Don't enable ETR when it's not ready

From: Yabin Cui
Date: Tue Jan 24 2023 - 15:09:50 EST


Ping for review. And I still can't reproduce it, even if I reduced the
timeout to 2us and tried different workloads. Any suggestions for how
to reproduce it?

Thanks,
Yabin

On Tue, Jan 10, 2023 at 1:06 PM Yabin Cui <yabinc@xxxxxxxxxx> wrote:
>
> > Do you have a reproducer for this or some more info?
> > For example is it a regression or has it always been there? And on which
> > platform.
>
> It happens on Pixel 6 and 7. We collect ETM data periodically from some
> internal dogfood devices. The problem has happened several times on
> dogfood devices. But I am still trying to reproduce it locally.
>
> We use the scatter-gather mode of ETR, and allocate a 4M buffer. In userspace,
> we use simpleperf in Android to collect system wide ETM data. What is special
> is, simpleperf disables and reenables perf events every 100ms to flush ETM
> data to perf aux buffer.
>
> Pixel 6 and 7 have hardware monitoring AXI traffic. The hardware finds ETR is
> trying to read from or write to a low invalid address (like 0x2E0000). The
> problem always happens right after the "tmc_etr: timeout while waiting for TMC
> to be Ready" message. And in almost all cases, I can find a "timeout while
> waiting for completion of Manual Flush" message from the previous session.
>
> One log history is below:
> [11484.610008][ C0] coresight tmc_etr0: timeout while waiting for
> completion of Manual Flush
> [11484.610177][ C0] coresight tmc_etr0: timeout while waiting for
> TMC to be Ready
> [11484.615367][ C0] coresight tmc_etr0: timeout while waiting for
> completion of Manual Flush
> [11484.615534][ C0] coresight tmc_etr0: timeout while waiting for
> TMC to be Ready
> [12089.486044][ C0] coresight tmc_etr0: timeout while waiting for
> TMC to be Ready
> AXI error report reading from invalid address
>
> Another log history is below:
> [76709.382650][ C5] coresight tmc_etf1: timeout while waiting for
> TMC to be Ready
> [76709.382852][ C7] coresight tmc_etr0: timeout while waiting for
> completion of Manual Flush
> [76709.382995][ C7] coresight tmc_etr0: timeout while waiting for
> TMC to be Ready
> [76709.384510][ C7] coresight tmc_etr0: timeout while waiting for
> completion of Manual Flush
> [76709.384649][ C7] coresight tmc_etr0: timeout while waiting for
> TMC to be Ready
> [76709.384838][ C0] coresight tmc_etr0: timeout while waiting for
> TMC to be Ready
> AIX error report writing to invalid address
>
> It seems if the previous manual flush doesn't finish gracefully, ETR may not be
> ready for the next enable (even after 10min as in the first log). And if we
> continue to enable ETR, an invalid AXI IO may happen.
>
> Thanks,
> Yabin
>
> On Tue, Jan 10, 2023 at 10:04 AM Suzuki K Poulose
> <suzuki.poulose@xxxxxxx> wrote:
> >
> > On 10/01/2023 17:48, Mike Leach wrote:
> > > Hi,
> > >
> > > On Tue, 10 Jan 2023 at 09:30, James Clark <james.clark@xxxxxxx> wrote:
> > >>
> > >>
> > >>
> > >> On 09/01/2023 23:43, Yabin Cui wrote:
> > >>> Otherwise, it may cause error in AXI bus and result in a kernel panic.
> > >>
> > >> Hi Yabin,
> > >>
> > >> Thanks for the fix. Do you have a reproducer for this or some more info?
> > >> For example is it a regression or has it always been there? And on which
> > >> platform.
> > >>
> > >> Thanks
> > >> James
> > >>
> > >>>
> > >>> Signed-off-by: Yabin Cui <yabinc@xxxxxxxxxx>
> > >>> ---
> > >>> .../hwtracing/coresight/coresight-tmc-core.c | 4 +++-
> > >>> .../hwtracing/coresight/coresight-tmc-etr.c | 18 +++++++++++++++---
> > >>> drivers/hwtracing/coresight/coresight-tmc.h | 2 +-
> > >>> 3 files changed, 19 insertions(+), 5 deletions(-)
> > >>>
> > >>> diff --git a/drivers/hwtracing/coresight/coresight-tmc-core.c b/drivers/hwtracing/coresight/coresight-tmc-core.c
> > >>> index 07abf28ad725..c106d142e632 100644
> > >>> --- a/drivers/hwtracing/coresight/coresight-tmc-core.c
> > >>> +++ b/drivers/hwtracing/coresight/coresight-tmc-core.c
> > >>> @@ -31,7 +31,7 @@ DEFINE_CORESIGHT_DEVLIST(etb_devs, "tmc_etb");
> > >>> DEFINE_CORESIGHT_DEVLIST(etf_devs, "tmc_etf");
> > >>> DEFINE_CORESIGHT_DEVLIST(etr_devs, "tmc_etr");
> > >>>
> > >>> -void tmc_wait_for_tmcready(struct tmc_drvdata *drvdata)
> > >>> +int tmc_wait_for_tmcready(struct tmc_drvdata *drvdata)
> > >>> {
> > >>> struct coresight_device *csdev = drvdata->csdev;
> > >>> struct csdev_access *csa = &csdev->access;
> > >>> @@ -40,7 +40,9 @@ void tmc_wait_for_tmcready(struct tmc_drvdata *drvdata)
> > >>> if (coresight_timeout(csa, TMC_STS, TMC_STS_TMCREADY_BIT, 1)) {
> > >>> dev_err(&csdev->dev,
> > >>> "timeout while waiting for TMC to be Ready\n");
> > >>> + return -EBUSY;
> > >>> }
> > >>> + return 0;
> > >>> }
> > >>>
> > >>> void tmc_flush_and_stop(struct tmc_drvdata *drvdata)
> > >>> diff --git a/drivers/hwtracing/coresight/coresight-tmc-etr.c b/drivers/hwtracing/coresight/coresight-tmc-etr.c
> > >>> index 867ad8bb9b0c..2da99dd41ed6 100644
> > >>> --- a/drivers/hwtracing/coresight/coresight-tmc-etr.c
> > >>> +++ b/drivers/hwtracing/coresight/coresight-tmc-etr.c
> > >>> @@ -983,15 +983,21 @@ static void tmc_sync_etr_buf(struct tmc_drvdata *drvdata)
> > >>> etr_buf->ops->sync(etr_buf, rrp, rwp);
> > >>> }
> > >>>
> > >>> -static void __tmc_etr_enable_hw(struct tmc_drvdata *drvdata)
> > >>> +static int __tmc_etr_enable_hw(struct tmc_drvdata *drvdata)
> > >>> {
> > >>> u32 axictl, sts;
> > >>> struct etr_buf *etr_buf = drvdata->etr_buf;
> > >>> + int rc = 0;
> > >>>
> > >>> CS_UNLOCK(drvdata->base);
> > >>>
> > >>> /* Wait for TMCSReady bit to be set */
> > >>> - tmc_wait_for_tmcready(drvdata);
> > >>> + rc = tmc_wait_for_tmcready(drvdata);
> > >>> + if (rc) {
> > >>> + dev_err(&drvdata->csdev->dev, "not ready ETR isn't enabled\n");
> > >>> + CS_LOCK(drvdata->base);
> > >>> + return rc;
> > >>> + }
> > >>>
> > >>> writel_relaxed(etr_buf->size / 4, drvdata->base + TMC_RSZ);
> > >>> writel_relaxed(TMC_MODE_CIRCULAR_BUFFER, drvdata->base + TMC_MODE);
> > >>> @@ -1032,6 +1038,7 @@ static void __tmc_etr_enable_hw(struct tmc_drvdata *drvdata)
> > >>> tmc_enable_hw(drvdata);
> > >>>
> > >>> CS_LOCK(drvdata->base);
> > >>> + return rc;
> > >>> }
> > >>>
> > >>> static int tmc_etr_enable_hw(struct tmc_drvdata *drvdata,
> > >>> @@ -1060,7 +1067,12 @@ static int tmc_etr_enable_hw(struct tmc_drvdata *drvdata,
> > >>> rc = coresight_claim_device(drvdata->csdev);
> > >>> if (!rc) {
> > >>> drvdata->etr_buf = etr_buf;
> > >>> - __tmc_etr_enable_hw(drvdata);
> > >>> + rc = __tmc_etr_enable_hw(drvdata);
> > >>> + if (rc) {
> > >>> + drvdata->etr_buf = NULL;
> > >>> + coresight_disclaim_device(drvdata->csdev);
> > >>> + tmc_etr_disable_catu(drvdata);
> > >>> + }
> > >>> }
> > >>>
> > >>> return rc;
> > >>> diff --git a/drivers/hwtracing/coresight/coresight-tmc.h b/drivers/hwtracing/coresight/coresight-tmc.h
> > >>> index 66959557cf39..01c0382a29c0 100644
> > >>> --- a/drivers/hwtracing/coresight/coresight-tmc.h
> > >>> +++ b/drivers/hwtracing/coresight/coresight-tmc.h
> > >>> @@ -255,7 +255,7 @@ struct tmc_sg_table {
> > >>> };
> > >>>
> > >>> /* Generic functions */
> > >>> -void tmc_wait_for_tmcready(struct tmc_drvdata *drvdata);
> > >>> +int tmc_wait_for_tmcready(struct tmc_drvdata *drvdata);
> > >>> void tmc_flush_and_stop(struct tmc_drvdata *drvdata);
> > >>> void tmc_enable_hw(struct tmc_drvdata *drvdata);
> > >>> void tmc_disable_hw(struct tmc_drvdata *drvdata);
> > >
> > > There is no point in waiting for a timeout, and then carrying on even
> > > when it is exceeded. As such this patch seems reasonable.
> > > We should also apply the same principle to the ETF and ETB devices
> > > which use the same tmc_wait_for_tmcready() function.
> >
> > +1
> >
> > I am fine with pushing this change, as it is doing the right thing.
> >
> > >
> > > However - the concern is that this appears to be happening on starting
> > > the ETR - there should be no outstanding AXI operations that cause the
> > > system to not be ready - as we will either be using this the first
> > > time after reset, or we should have successfully stopped and flushed
> > > the ETR from the previous operation. This warrants further
> > > investigation - should we be extending the timeout - which is already
> > > at a rather generous 100uS, and do we also need to check the MemErr
> > > bit in the status register?
> >
> > It would be good to dump the value of TMC_STATUS to see what is going
> > on.
> >
> > >
> > > As James says, we need details of when and how the problem occurs -
> > > as far as I know it has not been seen on any systems we currently use
> > > (though could have been missed given the current code)
> >
> > +1
> >
> > Kind regards
> > Suzuki
> >
> >
> > >
> > > Regards
> > >
> > > Mike
> > >
> > >
> > > --
> > > Mike Leach
> > > Principal Engineer, ARM Ltd.
> > > Manchester Design Centre. UK
> >