Re: [PATCH v3] interconnect: qcom: icc-rpmh: Add BCMs to commit list in pre_aggregate

From: Stephen Boyd
Date: Mon Apr 11 2022 - 16:20:33 EST


Quoting Stephen Boyd (2022-04-11 12:06:19)
> Quoting Alex Elder (2022-04-11 08:59:07)
> > The second problem you have is exhibited by the IPA driver if
> > the "fix" commit (upstream b95b668eaaa2) is back-ported to the
> > Linux 5.10.y LTS branch (along with some other prerequisite
> > commits). We can conclude that applying the above commit
> > makes the bandwidth for an unused interconnect (or perhaps
> > the rate for the IPA core clock) get set to zero. And in that
> > case, an attempt to access IPA hardware leads to the crash you
> > observed.
> >
> > The IPA driver does not implement runtime power management
> > until Linux v5.15. You later said you thought enabling that
> > might ensure the clock and interconnects were active when
> > needed by the IPA driver, and I concur (but there could be a
> > little more to it).
>
> Is the runtime PM patch series necessary to enable the IPA clk and
> interconnects? Things don't look good on 5.10.y and I'm not sure it will
> be workable. Commit b1d681d8d324 ("interconnect: Add sync state
> support") was introduced in v5.10 and that seems to be the commit that
> broke suspend on Lazor.
>
> >
> > In any case, based on the time stamp in your log, it seems
> > this problem is likely occurring upon the first access to IPA
> > hardware.
> >
> > I have a hunch about what might be happening here. There is
> > some synchronization that must occur between the AP and modem
> > when IPA is starting up. Until that synchronization step has
> > completed, we can't allow the IPA network device to be opened.
>
> Is there a commit that implements this? Or how is the synchronization
> done? I can debug more and see if that synchronization is happening.
>
> > In later kernels I think this is precluded, but perhaps in
> > Linux v5.10 it isn't. Until I look a little more closely I'm
> > not sure what would happen, but it *could* be this.
> >
> > I'm going to look a little how the particular access that
> > caused the crash is prevented in newer kernels. It could
> > be that back-porting that (or re-implementing it for the
> > older kernel) will address the crash you're seeing.
> >

I think it's some IPA unclocked access because it's an SError async
abort. I looked some at the IPA clk implementation and it is a little
concerning. I see two problems. The first problem is that clk-rpmh is
only voting on the active state for the IPA clk. Maybe RPMh will move
the sleep/wake state over from the active state automatically as long as
we never make a request in either of those states? I don't know, but it
looks concerning and either needs to be fixed or a big comment
indicating that the copy happens. This patch makes it set a frequency in
each state bucket.

----8<----
diff --git a/drivers/clk/qcom/clk-rpmh.c b/drivers/clk/qcom/clk-rpmh.c
index 74e57c84f60a..03c8c0e7db7a 100644
--- a/drivers/clk/qcom/clk-rpmh.c
+++ b/drivers/clk/qcom/clk-rpmh.c
@@ -260,6 +260,7 @@ static int clk_rpmh_bcm_send_cmd(struct clk_rpmh
*c, bool enable)
struct tcs_cmd cmd = { 0 };
u32 cmd_state;
int ret = 0;
+ enum rpmh_state state;

mutex_lock(&rpmh_clk_lock);
if (enable) {
@@ -274,15 +275,19 @@ static int clk_rpmh_bcm_send_cmd(struct clk_rpmh
*c, bool enable)
cmd.addr = c->res_addr;
cmd.data = BCM_TCS_CMD(1, enable, 0, cmd_state);

- ret = clk_rpmh_send(c, RPMH_ACTIVE_ONLY_STATE, &cmd, enable);
- if (ret) {
- dev_err(c->dev, "set active state of %s failed: (%d)\n",
- c->res_name, ret);
- } else {
- c->last_sent_aggr_state = cmd_state;
+ for (state = RPMH_SLEEP_STATE; state <= RPMH_ACTIVE_ONLY_STATE; state++) {
+ ret = clk_rpmh_send(c, state, &cmd, enable);
+ if (ret) {
+ dev_err(c->dev, "set %s state of %s failed: (%d)\n",
+ !state ? "sleep" :
+ state == RPMH_WAKE_ONLY_STATE ?
+ "wake" : "active", c->res_name, ret);
+ goto out;
+ }
}
+ c->last_sent_aggr_state = cmd_state;
}
-
+out:
mutex_unlock(&rpmh_clk_lock);

return ret;

---8<-----

The second problem I see is that the IPA clk resource is "IP0" but it is
still defined in the interconnect/qcom/sc7180.c file.

$ git grep \"IP0\" -- drivers/{interconnect,clk}/
drivers/clk/qcom/clk-rpmh.c:DEFINE_CLK_RPMH_BCM(sdm845, ipa, "IP0");
drivers/clk/qcom/clk-rpmh.c:DEFINE_CLK_RPMH_BCM(sdx55, ipa, "IP0");
drivers/interconnect/qcom/sc7180.c:DEFINE_QBCM(bcm_ip0, "IP0", false,
&ipa_core_slave);
drivers/interconnect/qcom/sc8180x.c:DEFINE_QBCM(bcm_ip0, "IP0", false,
&slv_ipa_core_slave);
drivers/interconnect/qcom/sdx55.c:DEFINE_QBCM(bcm_ip0, "IP0", false,
&ipa_core_slave);
drivers/interconnect/qcom/sm8150.c:DEFINE_QBCM(bcm_ip0, "IP0", false,
&ipa_core_slave);
drivers/interconnect/qcom/sm8250.c:DEFINE_QBCM(bcm_ip0, "IP0", false,
&ipa_core_slave);

That sounds pretty bad because it means both interconnect and clk frameworks
are trying to control the same RPMh resource, trampling over each other.
Combine that with unused clks and sync_state support and I don't know
what the state of the IP0 resource really is anymore.

When the clk-rpmh driver got an IPA clk a while ago I didn't want it to
manage this BCM[1]. That's because it doesn't use anything in the clk
framework besides the clk_set_rate() interface. No clk tree management,
or recalc_rate interface, etc. It seems like it kept getting added to
the interconnect framework after sdm845 though. Confusing!

Putting the discussion about how it should be represented in the kernel
aside, I applied this patch and rebooted 30 times and didn't see a
crash. I think this fixed it. I will back out the clk-rpmh patch now and
see if it is still fixed. It would be great if someone from qcom can
tell me what's going on.

diff --git a/drivers/interconnect/qcom/sc7180.c
b/drivers/interconnect/qcom/sc7180.c
index 12d59c36df53..5f7c0f85fa8e 100644
--- a/drivers/interconnect/qcom/sc7180.c
+++ b/drivers/interconnect/qcom/sc7180.c
@@ -47,7 +47,6 @@ DEFINE_QNODE(qnm_mnoc_sf,
SC7180_MASTER_MNOC_SF_MEM_NOC, 1, 32, SC7180_SLAVE_GEM
DEFINE_QNODE(qnm_snoc_gc, SC7180_MASTER_SNOC_GC_MEM_NOC, 1, 8,
SC7180_SLAVE_LLCC);
DEFINE_QNODE(qnm_snoc_sf, SC7180_MASTER_SNOC_SF_MEM_NOC, 1, 16,
SC7180_SLAVE_LLCC);
DEFINE_QNODE(qxm_gpu, SC7180_MASTER_GFX3D, 2, 32,
SC7180_SLAVE_GEM_NOC_SNOC, SC7180_SLAVE_LLCC);
-DEFINE_QNODE(ipa_core_master, SC7180_MASTER_IPA_CORE, 1, 8,
SC7180_SLAVE_IPA_CORE);
DEFINE_QNODE(llcc_mc, SC7180_MASTER_LLCC, 2, 4, SC7180_SLAVE_EBI1);
DEFINE_QNODE(qhm_mnoc_cfg, SC7180_MASTER_CNOC_MNOC_CFG, 1, 4,
SC7180_SLAVE_SERVICE_MNOC);
DEFINE_QNODE(qxm_camnoc_hf0, SC7180_MASTER_CAMNOC_HF0, 2, 32,
SC7180_SLAVE_MNOC_HF_MEM_NOC);
@@ -129,7 +128,6 @@ DEFINE_QNODE(qhs_mdsp_ms_mpu_cfg,
SC7180_SLAVE_MSS_PROC_MS_MPU_CFG, 1, 4);
DEFINE_QNODE(qns_gem_noc_snoc, SC7180_SLAVE_GEM_NOC_SNOC, 1, 8,
SC7180_MASTER_GEM_NOC_SNOC);
DEFINE_QNODE(qns_llcc, SC7180_SLAVE_LLCC, 1, 16, SC7180_MASTER_LLCC);
DEFINE_QNODE(srvc_gemnoc, SC7180_SLAVE_SERVICE_GEM_NOC, 1, 4);
-DEFINE_QNODE(ipa_core_slave, SC7180_SLAVE_IPA_CORE, 1, 8);
DEFINE_QNODE(ebi, SC7180_SLAVE_EBI1, 2, 4);
DEFINE_QNODE(qns_mem_noc_hf, SC7180_SLAVE_MNOC_HF_MEM_NOC, 1, 32,
SC7180_MASTER_MNOC_HF_MEM_NOC);
DEFINE_QNODE(qns_mem_noc_sf, SC7180_SLAVE_MNOC_SF_MEM_NOC, 1, 32,
SC7180_MASTER_MNOC_SF_MEM_NOC);
@@ -160,7 +158,6 @@ DEFINE_QBCM(bcm_mc0, "MC0", true, &ebi);
DEFINE_QBCM(bcm_sh0, "SH0", true, &qns_llcc);
DEFINE_QBCM(bcm_mm0, "MM0", false, &qns_mem_noc_hf);
DEFINE_QBCM(bcm_ce0, "CE0", false, &qxm_crypto);
-DEFINE_QBCM(bcm_ip0, "IP0", false, &ipa_core_slave);
DEFINE_QBCM(bcm_cn0, "CN0", true, &qnm_snoc, &xm_qdss_dap,
&qhs_a1_noc_cfg, &qhs_a2_noc_cfg, &qhs_ahb2phy0, &qhs_aop, &qhs_aoss,
&qhs_boot_rom, &qhs_camera_cfg, &qhs_camera_nrt_throttle_cfg,
&qhs_camera_rt_throttle_cfg, &qhs_clk_ctl, &qhs_cpr_cx, &qhs_cpr_mx,
&qhs_crypto0_cfg, &qhs_dcc_cfg, &qhs_ddrss_cfg, &qhs_display_cfg,
&qhs_display_rt_throttle_cfg, &qhs_display_throttle_cfg, &qhs_glm,
&qhs_gpuss_cfg, &qhs_imem_cfg, &qhs_ipa, &qhs_mnoc_cfg, &qhs_mss_cfg,
&qhs_npu_cfg, &qhs_npu_dma_throttle_cfg, &qhs_npu_dsp_throttle_cfg,
&qhs_pimem_cfg, &qhs_prng, &qhs_qdss_cfg, &qhs_qm_cfg,
&qhs_qm_mpu_cfg, &qhs_qup0, &qhs_qup1, &qhs_security, &qhs_snoc_cfg,
&qhs_tcsr, &qhs_tlmm_1, &qhs_tlmm_2, &qhs_tlmm_3, &qhs_ufs_mem_cfg,
&qhs_usb3, &qhs_venus_cfg, &qhs_venus_throttle_cfg,
&qhs_vsense_ctrl_cfg, &srvc_cnoc);
DEFINE_QBCM(bcm_mm1, "MM1", false, &qxm_camnoc_hf0_uncomp,
&qxm_camnoc_hf1_uncomp, &qxm_camnoc_sf_uncomp, &qhm_mnoc_cfg,
&qxm_mdp0, &qxm_rot, &qxm_venus0, &qxm_venus_arm9);
DEFINE_QBCM(bcm_sh2, "SH2", false, &acm_sys_tcu);
@@ -372,22 +369,6 @@ static struct qcom_icc_desc sc7180_gem_noc = {
.num_bcms = ARRAY_SIZE(gem_noc_bcms),
};

-static struct qcom_icc_bcm *ipa_virt_bcms[] = {
- &bcm_ip0,
-};
-
-static struct qcom_icc_node *ipa_virt_nodes[] = {
- [MASTER_IPA_CORE] = &ipa_core_master,
- [SLAVE_IPA_CORE] = &ipa_core_slave,
-};
-
-static struct qcom_icc_desc sc7180_ipa_virt = {
- .nodes = ipa_virt_nodes,
- .num_nodes = ARRAY_SIZE(ipa_virt_nodes),
- .bcms = ipa_virt_bcms,
- .num_bcms = ARRAY_SIZE(ipa_virt_bcms),
-};
-
static struct qcom_icc_bcm *mc_virt_bcms[] = {
&bcm_acv,
&bcm_mc0,
@@ -519,8 +500,6 @@ static const struct of_device_id qnoc_of_match[] = {
.data = &sc7180_dc_noc},
{ .compatible = "qcom,sc7180-gem-noc",
.data = &sc7180_gem_noc},
- { .compatible = "qcom,sc7180-ipa-virt",
- .data = &sc7180_ipa_virt},
{ .compatible = "qcom,sc7180-mc-virt",
.data = &sc7180_mc_virt},
{ .compatible = "qcom,sc7180-mmss-noc",

----8<----

[1] https://lore.kernel.org/r/154411934006.88331.2149751521264448532@xxxxxxxxxxxxxxxxxxxxxxxxxx