Re: [PATCH v9 09/17] arm: tegra20: cpuidle: Handle case where secondary CPU hangs on entering LP2

From: Dmitry Osipenko
Date: Fri Feb 21 2020 - 15:42:25 EST


21.02.2020 23:21, Dmitry Osipenko ÐÐÑÐÑ:
> 21.02.2020 23:02, Daniel Lezcano ÐÐÑÐÑ:
>> On 21/02/2020 19:19, Dmitry Osipenko wrote:
>>> 21.02.2020 20:36, Daniel Lezcano ÐÐÑÐÑ:
>>>> On Fri, Feb 21, 2020 at 07:56:51PM +0300, Dmitry Osipenko wrote:
>>>>> Hello Daniel,
>>>>>
>>>>> 21.02.2020 18:43, Daniel Lezcano ÐÐÑÐÑ:
>>>>>> On Thu, Feb 13, 2020 at 02:51:26AM +0300, Dmitry Osipenko wrote:
>>>>>>> It is possible that something may go wrong with the secondary CPU, in that
>>>>>>> case it is much nicer to get a dump of the flow-controller state before
>>>>>>> hanging machine.
>>>>>>>
>>>>>>> Acked-by: Peter De Schrijver <pdeschrijver@xxxxxxxxxx>
>>>>>>> Tested-by: Peter Geis <pgwipeout@xxxxxxxxx>
>>>>>>> Tested-by: Jasper Korten <jja2000@xxxxxxxxx>
>>>>>>> Tested-by: David Heidelberg <david@xxxxxxx>
>>>>>>> Signed-off-by: Dmitry Osipenko <digetx@xxxxxxxxx>
>>>>>>> ---
>>>>
>>>> [ ... ]
>>>>
>>>>>>> +static int tegra20_wait_for_secondary_cpu_parking(void)
>>>>>>> +{
>>>>>>> + unsigned int retries = 3;
>>>>>>> +
>>>>>>> + while (retries--) {
>>>>>>> + ktime_t timeout = ktime_add_ms(ktime_get(), 500);
>>>>>>
>>>>>> Oops I missed this one. Do not use ktime_get() in this code path, use jiffies.
>>>>>
>>>>> Could you please explain what benefits jiffies have over the ktime_get()?
>>>>
>>>> ktime_get() is very slow, jiffies is updated every tick.
>>>
>>> But how jiffies are supposed to be updated if interrupts are disabled?
>>
>> Yeah, other cpus must not be idle in this.
>
> Okay, then jiffies can't be used here because this function is used for
> the coupled / power-gated state only. All CPUs are idling in this state.
>
>>> Aren't jiffies actually slower than ktime_get() because jiffies are
>>> updating every 10/1ms (depending on CONFIG_HZ)?
>>
>> They are no slower, they have a lower resolution which is 10ms or 4ms.
>>
>> Given the 500ms timeout, it is fine.
>>
>>> We're kinda interesting here in getting into deep-idling state as quick
>>> as possible. I was checking how much time takes the busy-loop below and
>>> it takes ~40-150us in average, which is good enough.
>>
>> ktime_get() gets a seq lock and it is very slow.
>
> Since all CPUs are idling here, the locking isn't a problem.
>
> The wait_for_secondary_cpu_parking() function is called on CPU0, it
> waits for the secondary CPUs to enter into safe-state before CPU0 could
> power-gate the whole CPU cluster.
>
>>>>>>> +
>>>>>>> + /*
>>>>>>> + * The primary CPU0 core shall wait for the secondaries
>>>>>>> + * shutdown in order to power-off CPU's cluster safely.
>>>>>>> + * The timeout value depends on the current CPU frequency,
>>>>>>> + * it takes about 40-150us in average and over 1000us in
>>>>>>> + * a worst case scenario.
>>>>>>> + */
>>>>>>> + do {
>>>>>>> + if (tegra_cpu_rail_off_ready())
>>>>>>> + return 0;
>>>>>>> +
>>>>>>> + } while (ktime_before(ktime_get(), timeout));
>>>>>>
>>>>>> So this loop will aggresively call tegra_cpu_rail_off_ready() and retry 3
>>>>>> times. The tegra_cpu_rail_off_ready() function can be called thoushand of times
>>>>>> here but the function will hang 1.5s :/
>>>>>>
>>>>>> I suggest something like:
>>>>>>
>>>>>> while (retries--i && !tegra_cpu_rail_off_ready())
>>>>>> udelay(100);
>>>>>>
>>>>>> So <retries> calls to tegra_cpu_rail_off_ready() and 100us x <retries> maximum
>>>>>> impact.
>>>>> But udelay() also results into CPU spinning in a busy-loop, and thus,
>>>>> what's the difference?
>>>>
>>>> busy looping instead of register reads with all the hardware things involved behind.
>>>
>>> Please notice that this code runs only on an older Cortex-A9/A15, which
>>> doesn't support WFE for the delaying, and thus, CPU always busy-loops
>>> inside udelay().
>>>
>>> What about if I'll add cpu_relax() to the loop? Do you think it it could
>>> have any positive effect?
>>
>> I think udelay() has a call to cpu_relax().
>
> Yes, my point is that udelay() doesn't bring much benefit for us here
> because:
>
> 1. we want to enter into power-gated state as quick as possible and
> udelay() just adds an unnecessary delay
>
> 2. udelay() spins in a busy-loop until delay is expired, just like we're
> doing it in this function already

I'll try the udelay()-loop over the weekend and will see if it makes any
real difference, maybe I'm missing something.

If it doesn't make any difference, I'll leave this patch as-is, okay?