Re: [PATCH] tpm: don't destroy chip device prematurely

From: Jason Gunthorpe
Date: Wed Oct 05 2016 - 22:08:07 EST


On Thu, Oct 06, 2016 at 12:43:13AM +0000, Winkler, Tomas wrote:

> > You keep asserting that, but it just isn't true at all.
>
> Okay, let's rephrase, that calling device_del before tpm_transmit is not sane when using runtime_pm.

Maybe, but I haven't heard an explanation from you why that is the case,
and I haven't found one on my own..

> > Sure, but that relationship only maters if something does a PM call on the
> > chip->dev, and AFAIK, nothing does that.
> >
> > Do you know differently?
>
> You've pasted that code in in the previous mail, parent is involved on device remove.

I pasted the code to show it didn't seem possible to hit it because
irq_safe should not be true for chip->dev. You never explained how
this code can run.

> > You pointed at something that isn't even run and said it is the source of the
> > problem.. You really need to set up here and explain exactly what is
> > happening.
>
> Sorry, lost you here.

You haven't done a good job explaining the problem in detail beyond some general
blame toward PM.

> > > > Even if that pm_runtime_put is happening, why doesn't the
> > > >
> > > > + pm_runtime_get_sync(chip->dev.parent);
> > > >
> > > > The runtime_pm patch adds to tpm_transmit take care of bringing the
> > > > device back?
> > >
> > > Unfortunately not because, because it gets out of sync and what is
> > > actually happening is that idle callback is called and device is put
> > > to idle between send and receive.
> >
> > What? As far as I understand this PM stuff, I would call that a very serious bug.
>
> Maybe, but then you have to find what a bug, currently it looks
> like wrong usage of the device.

Yes, you actually have to find and explain the bug to fix it.

You still haven't explained at all why device_del on the child causes
pm_runtime_get_sync() on the parent to malfunction. There is certainly
seems to be nothing intrinsic about the PM core that would cause that.

*That* is the really critical bit of explanation that is missing.

Until you can provide it there is no reason to continue discussing.

> > If a PM transition during transmit_cmd causes the TPM to abort/fail command
> > execution then it *MUST* be prevented. Period.
>
> Or, we can call device_del after tpm2_shutdown.

What? You haven't even established root cause, how do you know this
bug won't manifest in other cases? It could very well be some kind of
generic race-bug triggered by the proximity of device_del and
pm_runtime_get_sync. Or a HW bug of some kind..

> I will send you the actual trace, anyhow I've respin the original
> version with go_idle and cmd_ready handlers, this is contra
> productive, the time is just not right.

I'm deeply skeptical about all your patches if you can't root-cause
identify why the existing implementation isn't working.

But, you can sort out with Jarkko what to do with crb.

As long at the rest of the drivers and the core subsystem are not
broken by the device_del change. Jarkko - can you confirm you will
drop that patch?

Jason