Re: [PATCH v2 1/2] i2c: tegra: Better handle case where CPU0 is busy for a long time

From: Dmitry Osipenko
Date: Mon Apr 20 2020 - 20:32:27 EST


21.04.2020 01:11, Dmitry Osipenko ÐÐÑÐÑ:
> Hello Jon,
>
> 20.04.2020 22:53, Jon Hunter ÐÐÑÐÑ:
>> Hi Dmitry,
>>
>> On 24/03/2020 19:12, Dmitry Osipenko wrote:
>>> Boot CPU0 always handle I2C interrupt and under some rare circumstances
>>> (like running KASAN + NFS root) it may stuck in uninterruptible state for
>>> a significant time. In this case we will get timeout if I2C transfer is
>>> running on a sibling CPU, despite of IRQ being raised. In order to handle
>>> this rare condition, the IRQ status needs to be checked after completion
>>> timeout.
>>>
>>> Signed-off-by: Dmitry Osipenko <digetx@xxxxxxxxx>
>>> ---
>>> drivers/i2c/busses/i2c-tegra.c | 27 +++++++++++++++------------
>>> 1 file changed, 15 insertions(+), 12 deletions(-)
>>
>>
>> I have noticed a regression on tegra30-cardhu-a04 when testing system
>> suspend. Git bisect is pointing to this commit and reverting it fixes
>> the problem. In the below console log I2C fails to resume ...
>>
> ...
>> [ 60.690035] PM: Device 3000.pcie failed to resume noirq: error -16
>
> ...
>> [ 60.696859] tegra-mc 7000f000.memory-controller: fdcdwr2: write @0x877e8400: EMEM address decode error (SMMU translation error [--S])
>>
>> [ 60.708876] tegra-mc 7000f000.memory-controller: fdcdwr2: write @0x877e8400: Page fault (SMMU translation error [--S])
>
> This looks very wrong, the error tells that 3d hardware is active and
> doing something odd. Are you running some 3d tests?
>
> ...
>> Have you seen this?
>
> No, I haven't seen that. I'm not using PCIE and it looks like it's the
> problem.
>
> Looking at the PCIE driver code, seems it's not syncing the RPM state on
> suspend/resume.
>
> Please try this change:
>
> --- >8 ---
> diff --git a/drivers/pci/controller/pci-tegra.c
> b/drivers/pci/controller/pci-tegra.c
> index 3e64ba6a36a8..b1fcbae4109c 100644
> --- a/drivers/pci/controller/pci-tegra.c
> +++ b/drivers/pci/controller/pci-tegra.c
> @@ -2870,8 +2870,8 @@ static int __maybe_unused
> tegra_pcie_pm_resume(struct device *dev)
>
> static const struct dev_pm_ops tegra_pcie_pm_ops = {
> SET_RUNTIME_PM_OPS(tegra_pcie_pm_suspend, tegra_pcie_pm_resume, NULL)
> - SET_NOIRQ_SYSTEM_SLEEP_PM_OPS(tegra_pcie_pm_suspend,
> - tegra_pcie_pm_resume)
> + SET_NOIRQ_SYSTEM_SLEEP_PM_OPS(pm_runtime_force_suspend,
> + pm_runtime_force_resume)
> };
>
>
> static struct platform_driver tegra_pcie_driver = {
> --- >8 ---
>
> Secondly, I2C driver suspends on NOIRQ level, while APBDMA driver
> suspends on default level. This is also wrong, please try to apply this
> hunk as well:
>
> --- >8 ---
> diff --git a/drivers/dma/tegra20-apb-dma.c b/drivers/dma/tegra20-apb-dma.c
> index f6a2f42ffc51..e682ac86bd27 100644
> --- a/drivers/dma/tegra20-apb-dma.c
> +++ b/drivers/dma/tegra20-apb-dma.c
> @@ -1653,7 +1653,7 @@ static int __maybe_unused
> tegra_dma_dev_resume(struct device *dev)
> static const struct dev_pm_ops tegra_dma_dev_pm_ops = {
> SET_RUNTIME_PM_OPS(tegra_dma_runtime_suspend, tegra_dma_runtime_resume,
> NULL)
> - SET_SYSTEM_SLEEP_PM_OPS(tegra_dma_dev_suspend, tegra_dma_dev_resume)
> + SET_NOIRQ_SYSTEM_SLEEP_PM_OPS(tegra_dma_dev_suspend, tegra_dma_dev_resume)
> };
>
> static const struct of_device_id tegra_dma_of_match[] = {
> --- >8 ---
>

Although, I'm now having a second though about the APBDMA change... I'm
recalling that there are some complications in regards to PCIE driver
suspending, requiring it to be at NOIRQ level, but this should be wrong
because PCIE driver uses voltage regulator driver at NOIRQ level, while
regulator drivers suspend on default level. The current behavior of the
PCIE driver should be wrong, I think it needs to be moved to the default
suspend-resume level somehow.