Re: [PATCH v3] interconnect: qcom: icc-rpmh: Add BCMs to commit list in pre_aggregate

From: Stephen Boyd
Date: Mon Apr 11 2022 - 15:07:24 EST


Quoting Alex Elder (2022-04-11 08:59:07)
> On 4/5/22 6:00 PM, Stephen Boyd wrote:
> > Quoting Georgi Djakov (2021-11-25 09:47:51)
> >> From: Mike Tipton <mdtipton@xxxxxxxxxxxxxx>
> >>
> >> We're only adding BCMs to the commit list in aggregate(), but there are
> >> cases where pre_aggregate() is called without subsequently calling
> >> aggregate(). In particular, in icc_sync_state() when a node with initial
> >> BW has zero requests. Since BCMs aren't added to the commit list in
> >> these cases, we don't actually send the zero BW request to HW. So the
> >> resources remain on unnecessarily.
> >>
> >> Add BCMs to the commit list in pre_aggregate() instead, which is always
> >> called even when there are no requests.
> >>
> >> Signed-off-by: Mike Tipton <mdtipton@xxxxxxxxxxxxxx>
> >> [georgi: remove icc_sync_state for platforms with incomplete support]
> >> Signed-off-by: Georgi Djakov <djakov@xxxxxxxxxx>
>
> I'm back from vacation and am finally giving proper attention to
> this. I want to make sure I understand the problem, because there
> are (at least) two parts to it.
>
> - The first problem you observe is that you are not seeing XO
> shutdown on suspend on a Lazor device.
> - You didn't say this directly but I think you are seeing this
> on Linux v5.15.y (the 5.15 LTS branch), or perhaps on something
> derived from that branch.

Yes.

> - You find that if you back-port (or cherry-pick?) the commit
> that landed upstream as b95b668eaaa2 ("interconnect: qcom:
> icc-rpmh: Add BCMs to commit list in pre_aggregate
> "), you
> *do* see XO shutdown on suspend, as desired.

Correct.

>
> Here's what I understand that commit to do:
> - In some cases, the bus clock managers (BCMs) are configured
> by the boot loader so that some interconnects have non-zero
> initial bandwidth.
> - There is no sense in keeping an interconnect active if Linux
> has nothing that requires its use. So we would like Linux to
> ensure the configured bandwidth for an *unused* interconnect
> is zero.
> - Prior to that commit, BCM-managed hardware was only queued
> to update its configuration when the ->aggregate interconnect
> provider function was called. After that commit, updates were
> queued by the ->pre_aggregate provider function.

Also before that commit interconnects are maxed out, which doesn't
really matter for XO shutdown but it means that we're running faster
than what the bootloader configures if boot is slower.

> - Unlike the ->aggregate callback, the ->pre_aggregate provider
> function queues updates to the hardware configuration whether
> or not they have active users.
> - The result of this commit is that the hardware configuration
> for all defined BCM-managed interconnects is updated, and in
> particular, the configured bandwidth for unused interconnects
> is set to zero.

Yep.

>
> When unused interconnects are configured for zero bandwidth, they
> do not require an active main XO clock, and so with this commit
> it becomes possible for the XO clock to be shut down.
>
> And that's why this commit addresses your XO shutdown problem on
> the Linux 5.15 LTS branch.
>
> Is the above an accurate description?

Yeah pretty much. Without the interconnect patch I can't get XO
shutdown.

>
> Looking at that branch, I see this commit: f753067494c27
> ("Revert "interconnect: qcom: icc-rpmh: Add BCMs to commit
> list in pre_aggregate"
> "). Which shows that an attempt was made
> to include this commit in the 5.15 LTS branch, but it caused
> some *other* regressions. That suggests this might not be
> easy to fix.

Indeed. The commit was reverted because it broke reboot for me. I see it
was reintroduced a few months later though when I was eating
Thanksgiving dinner and I didn't notice until now. Interestingly the
reboot issue is gone. Here's the crash from back then.

SError Interrupt on CPU6, code 0xbe000411 -- SError
CPU: 6 PID: 8772 Comm: reboot Not tainted 5.14.0-rc5-next-20210810+ #1
Hardware name: Google Lazor (rev3+) with KB Backlight (DT)
pstate: 004000c9 (nzcv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : el1_interrupt+0x20/0x60
lr : el1h_64_irq_handler+0x18/0x24
sp : ffffffc0114139c0
x29: ffffffc0114139c0 x28: ffffff808a8b2240 x27: 0000000000000000
x26: ffffff80817ec018 x25: ffffffd79e8f0000 x24: ffffffd79e957000
x23: 0000000000400009 x22: ffffffd79dac527c x21: ffffffc011413b40
x20: ffffffd79d6100f8 x19: ffffffc0114139f0 x18: 0000000000022a07
x17: 0000000000000000 x16: ffffffd79dac52e4 x15: ffffff80d291fe80
x14: 0000000000000580 x13: 000000000000300c x12: ffffff80b4f7ed10
x11: 0000000000000003 x10: 00000000c0000000 x9 : 0000000000000003
x8 : 00000000000000c0 x7 : bbbbbbbbbbbbbbbb x6 : 0000000000000001
x5 : 0000000000170006 x4 : ffffff80d291bcc0 x3 : ffffffd79e429b41
x2 : 0000000000000002 x1 : ffffffd79d6100f8 x0 : ffffffc0114139f0
Kernel panic - not syncing: Asynchronous SError Interrupt
CPU: 6 PID: 8772 Comm: reboot Not tainted 5.14.0-rc5-next-20210810+ #1
Hardware name: Google Lazor (rev3+) with KB Backlight (DT)
Call trace:
dump_backtrace+0x0/0x1d4
show_stack+0x24/0x30
dump_stack_lvl+0x64/0x7c
dump_stack+0x18/0x38
panic+0x150/0x38c
nmi_panic+0x88/0xa0
arm64_serror_panic+0x74/0x80
is_valid_bugaddr+0x0/0x1c
el1h_64_error_handler+0x30/0x48
el1h_64_error+0x78/0x7c
el1_interrupt+0x20/0x60
el1h_64_irq_handler+0x18/0x24
el1h_64_irq+0x78/0x7c
refcount_dec_not_one+0x48/0xb0
refcount_dec_and_mutex_lock+0x1c/0xb4
ipa_clock_put+0x34/0x74 [ipa]
ipa_uc_deconfig+0x4c/0x5c [ipa]
ipa_deconfig+0x30/0x90 [ipa]
ipa_remove+0xbc/0x11c [ipa]
platform_shutdown+0x30/0x3c
device_shutdown+0x150/0x208
kernel_restart_prepare+0x44/0x50

>
> ---
>
> The second problem you have is exhibited by the IPA driver if
> the "fix" commit (upstream b95b668eaaa2) is back-ported to the
> Linux 5.10.y LTS branch (along with some other prerequisite
> commits). We can conclude that applying the above commit
> makes the bandwidth for an unused interconnect (or perhaps
> the rate for the IPA core clock) get set to zero. And in that
> case, an attempt to access IPA hardware leads to the crash you
> observed.
>
> The IPA driver does not implement runtime power management
> until Linux v5.15. You later said you thought enabling that
> might ensure the clock and interconnects were active when
> needed by the IPA driver, and I concur (but there could be a
> little more to it).

Is the runtime PM patch series necessary to enable the IPA clk and
interconnects? Things don't look good on 5.10.y and I'm not sure it will
be workable. Commit b1d681d8d324 ("interconnect: Add sync state
support") was introduced in v5.10 and that seems to be the commit that
broke suspend on Lazor.

>
> In any case, based on the time stamp in your log, it seems
> this problem is likely occurring upon the first access to IPA
> hardware.
>
> I have a hunch about what might be happening here. There is
> some synchronization that must occur between the AP and modem
> when IPA is starting up. Until that synchronization step has
> completed, we can't allow the IPA network device to be opened.

Is there a commit that implements this? Or how is the synchronization
done? I can debug more and see if that synchronization is happening.

> In later kernels I think this is precluded, but perhaps in
> Linux v5.10 it isn't. Until I look a little more closely I'm
> not sure what would happen, but it *could* be this.
>
> I'm going to look a little how the particular access that
> caused the crash is prevented in newer kernels. It could
> be that back-porting that (or re-implementing it for the
> older kernel) will address the crash you're seeing.
>