RE: [PATCH] tpm: don't destroy chip device prematurely

From: Winkler, Tomas
Date: Fri Oct 07 2016 - 10:26:30 EST



> On Thu, Oct 06, 2016 at 12:43:13AM +0000, Winkler, Tomas wrote:
>
> > > You keep asserting that, but it just isn't true at all.
> >
> > Okay, let's rephrase, that calling device_del before tpm_transmit is not sane
> when using runtime_pm.
>
> Maybe, but I haven't heard an explanation from you why that is the case, and I
> haven't found one on my own..
>
> > > Sure, but that relationship only maters if something does a PM call
> > > on the
> > > chip->dev, and AFAIK, nothing does that.
> > >
> > > Do you know differently?
> >
> > You've pasted that code in in the previous mail, parent is involved on device
> remove.
>
> I pasted the code to show it didn't seem possible to hit it because irq_safe
> should not be true for chip->dev. You never explained how this code can run.
>
> > > You pointed at something that isn't even run and said it is the
> > > source of the problem.. You really need to set up here and explain
> > > exactly what is happening.
> >
> > Sorry, lost you here.
>
> You haven't done a good job explaining the problem in detail beyond some
> general blame toward PM.
>
> > > > > Even if that pm_runtime_put is happening, why doesn't the
> > > > >
> > > > > + pm_runtime_get_sync(chip->dev.parent);
> > > > >
> > > > > The runtime_pm patch adds to tpm_transmit take care of bringing
> > > > > the device back?
> > > >
> > > > Unfortunately not because, because it gets out of sync and what is
> > > > actually happening is that idle callback is called and device is
> > > > put to idle between send and receive.
> > >
> > > What? As far as I understand this PM stuff, I would call that a very serious
> bug.
> >
> > Maybe, but then you have to find what a bug, currently it looks like
> > wrong usage of the device.
>
> Yes, you actually have to find and explain the bug to fix it.
>
> You still haven't explained at all why device_del on the child causes
> pm_runtime_get_sync() on the parent to malfunction. There is certainly seems
> to be nothing intrinsic about the PM core that would cause that.
>
> *That* is the really critical bit of explanation that is missing.
>
> Until you can provide it there is no reason to continue discussing.
>
> > > If a PM transition during transmit_cmd causes the TPM to abort/fail
> > > command execution then it *MUST* be prevented. Period.
> >
> > Or, we can call device_del after tpm2_shutdown.
>
> What? You haven't even established root cause, how do you know this bug
> won't manifest in other cases? It could very well be some kind of generic race-
> bug triggered by the proximity of device_del and pm_runtime_get_sync. Or a
> HW bug of some kind..
>
> > I will send you the actual trace, anyhow I've respin the original
> > version with go_idle and cmd_ready handlers, this is contra
> > productive, the time is just not right.
>
> I'm deeply skeptical about all your patches if you can't root-cause identify why
> the existing implementation isn't working.
>
> But, you can sort out with Jarkko what to do with crb.
>
> As long at the rest of the drivers and the core subsystem are not broken by the
> device_del change. Jarkko - can you confirm you will drop that patch?

So here I'm to say I'm sorry for misleading this, after all the doubts I got back to debugging and traces.
One thing for a reason moving the device_del, had really made the problem go away, but the real problem was unbalance runtime_pm PUT/GET from the tpm_crb probe function.
I will post the fixed patch, of course, this one should be dropped.
In any case, and this is not just to keep my ego up, that calling to the tom stack with unutilized dev is not healthy and we should look for that.
Thanks
Tomas