Re: [REGRESSION] CPUIDLE_FLAG_RCU_IDLE, blk_mq_freeze_queue_wait() and slow-stuck reboots

From: Alexey Klimov
Date: Thu Mar 16 2023 - 22:11:48 EST


On Wed, 15 Mar 2023 at 11:16, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
>
> (could you wrap your email please)

Ouch. Sorry.

> On Tue, Mar 14, 2023 at 11:00:04PM +0000, Alexey Klimov wrote:
> > #regzbot introduced: 0c5ffc3d7b15 #regzbot title:
> > CPUIDLE_FLAG_RCU_IDLE, blk_mq_freeze_queue_wait() and slow-stuck
> > reboots
> >
> > The upstream changes are being merged into android-mainline repo and
> > at some point we started to observe kernel panics on reboot or long
> > reboot times.
>
> On what hardware? I find it somewhat hard to follow this DT code :/

Pixel 6.

> > Looks like adding CPUIDLE_FLAG_RCU_IDLE flag to idle driver caused
> > this behaviour. The minimal change that is required for this system
> > to avoid the regression would be one liner that removes the flag
> > (below).
> >
> > But if it is a real regression, then other idle drivers if used will
> > likely cause this regression too withe same ufshcd driver. There is
> > also a suspicion that CPUIDLE_FLAG_RCU_IDLE just revealed or uncovered
> > some other problem.
> >
> > Any thoughts on this?
>
> So ARM has a weird 'rule' in that idle state 0 (wfi) should not have
> RCU_IDLE set, while others should have.
>
> Of the dt_init_idle_driver() users:
>
> - cpuidle-arm: arm_enter_idle_state()
> - cpuidle-big_little: bl_enter_powerdown() does ct_cpuidle_{enter,exit}()
> - cpuidle-psci: psci_enter_idle_state() uses CPU_PM_CPU_IDLE_ENTER_PARAM_RCU()
> - cpuidle-qcom-spm: spm_enter_idle_state() uses CPU_PM_CPU_IDLE_ENTER_PARAM()
> - cpuidle-riscv-sbi: sbi_cpuidle_enter_state() uses CPU_PM_CPU_IDLE_ENTER_*_PARAM()
>
> All of them start on index 1 and hence should have RCU_IDLE set, but at
> least the arm, qcom-spm and riscv-sbi don't actually appear to abide by
> the rules.
>
> Fixing that gives me the below; does that help?

Double-checked and it seems, unfortunately, the patch doesn't change
the behaviour at all.
The first problematic driver is ufshcd that slows down the reboot the most.
The another one is wlan bcm driver which callback is called from
blocking_notifier_call_chain(...).
Backtraces from it, when it is stuck/slow, involve pci and net
subsystems but I didn't yet narrow it
down to exact function or specific flow.
The patch from Bart helps with ufshcd driver but still reboot times
are 10-20 seconds.
The removing of RCU IDLE flag helps with both drivers.

Is there any debug data I can collect to help with this or any other
patches to test please?

Thanks,
Alexey