Re: [PATCH V2] PCI/ASPM: Save/restore L1SS Capability for suspend/resume

From: Lukasz Majczak
Date: Wed Aug 03 2022 - 08:05:21 EST


pt., 29 lip 2022 o 16:36 Vidya Sagar <vidyas@xxxxxxxxxx> napisał(a):
>
> Hi Lukasz,
> Thanks for sharing your observations.
>
> Could you please also share the output of 'sudo lspci -vvvv' before and
> after suspend-resume cycle with the latest linux-next?
> Do we still see the L1SS capabilities getting disappeared post resume?
>
> Thanks,
> Vidya Sagar
>
> On 7/29/2022 3:09 PM, Lukasz Majczak wrote:
> > External email: Use caution opening links or attachments
> >
> >
> > wt., 26 lip 2022 o 09:20 Lukasz Majczak <lma@xxxxxxxxxxxx> napisał(a):
> >>
> >> wt., 26 lip 2022 o 00:51 Rajat Jain <rajatja@xxxxxxxxxx> napisał(a):
> >>>
> >>> Hello,
> >>>
> >>> On Sat, Jul 23, 2022 at 10:03 AM Vidya Sagar <vidyas@xxxxxxxxxx> wrote:
> >>>>
> >>>> Agree with Bjorn's observations.
> >>>> The fact that the L1SS capability registers themselves disappeared in
> >>>> the root port post resume indicates that there seems to be something
> >>>> wrong with the BIOS itself.
> >>>> Could you please check from that perspective?
> >>>
> >>> ChromeOS Intel platforms use S0ix (suspend-to-idle) for suspend. This
> >>> is a shallower sleep state that preserves more state than, for e.g. S3
> >>> (suspend-to-RAM). When we use S0ix, then BIOS does not come in picture
> >>> at all. i.e. after the kernel runs its suspend routines, it just puts
> >>> the CPU into S0ix state. So I do not think there is a BIOS angle to
> >>> this.
> >>>
> >>>
> >>>>
> >>>> Thanks,
> >>>> Vidya Sagar
> >>>>
> >>>>
> >>>> On 7/22/2022 11:12 PM, Bjorn Helgaas wrote:
> >>>>> External email: Use caution opening links or attachments
> >>>>>
> >>>>>
> >>>>> On Fri, Jul 22, 2022 at 11:41:14AM +0200, Lukasz Majczak wrote:
> >>>>>> pt., 22 lip 2022 o 09:31 Kai-Heng Feng <kai.heng.feng@xxxxxxxxxxxxx> napisał(a):
> >>>>>>> On Fri, Jul 15, 2022 at 6:38 PM Ben Chuang <benchuanggli@xxxxxxxxx> wrote:
> >>>>>>>> On Tue, Jul 5, 2022 at 2:00 PM Vidya Sagar <vidyas@xxxxxxxxxx> wrote:
> >>>>>>>>>
> >>>>>>>>> Previously ASPM L1 Substates control registers (CTL1 and CTL2) weren't
> >>>>>>>>> saved and restored during suspend/resume leading to L1 Substates
> >>>>>>>>> configuration being lost post-resume.
> >>>>>>>>>
> >>>>>>>>> Save the L1 Substates control registers so that the configuration is
> >>>>>>>>> retained post-resume.
> >>>>>>>>>
> >>>>>>>>> Signed-off-by: Vidya Sagar <vidyas@xxxxxxxxxx>
> >>>>>>>>> Tested-by: Abhishek Sahu <abhsahu@xxxxxxxxxx>
> >>>>>>>>
> >>>>>>>> Hi Vidya,
> >>>>>>>>
> >>>>>>>> I tested this patch on kernel v5.19-rc6.
> >>>>>>>> The test device is GL9755 card reader controller on Intel i5-10210U RVP.
> >>>>>>>> This patch can restore L1SS after suspend/resume.
> >>>>>>>>
> >>>>>>>> The test results are as follows:
> >>>>>>>>
> >>>>>>>> After Boot:
> >>>>>>>> #lspci -d 17a0:9755 -vvv | grep -A5 "L1 PM Substates"
> >>>>>>>> Capabilities: [110 v1] L1 PM Substates
> >>>>>>>> L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+
> >>>>>>>> ASPM_L1.1+ L1_PM_Substates+
> >>>>>>>> PortCommonModeRestoreTime=255us
> >>>>>>>> PortTPowerOnTime=3100us
> >>>>>>>> L1SubCtl1: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+
> >>>>>>>> T_CommonMode=0us LTR1.2_Threshold=3145728ns
> >>>>>>>> L1SubCtl2: T_PwrOn=3100us
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> After suspend/resume without this patch.
> >>>>>>>> #lspci -d 17a0:9755 -vvv | grep -A5 "L1 PM Substates"
> >>>>>>>> Capabilities: [110 v1] L1 PM Substates
> >>>>>>>> L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+
> >>>>>>>> ASPM_L1.1+ L1_PM_Substates+
> >>>>>>>> PortCommonModeRestoreTime=255us
> >>>>>>>> PortTPowerOnTime=3100us
> >>>>>>>> L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
> >>>>>>>> T_CommonMode=0us LTR1.2_Threshold=0ns
> >>>>>>>> L1SubCtl2: T_PwrOn=10us
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> After suspend/resume with this patch.
> >>>>>>>> #lspci -d 17a0:9755 -vvv | grep -A5 "L1 PM Substates"
> >>>>>>>> Capabilities: [110 v1] L1 PM Substates
> >>>>>>>> L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+
> >>>>>>>> ASPM_L1.1+ L1_PM_Substates+
> >>>>>>>> PortCommonModeRestoreTime=255us
> >>>>>>>> PortTPowerOnTime=3100us
> >>>>>>>> L1SubCtl1: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+
> >>>>>>>> T_CommonMode=0us LTR1.2_Threshold=3145728ns
> >>>>>>>> L1SubCtl2: T_PwrOn=3100us
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Tested-by: Ben Chuang <benchuanggli@xxxxxxxxx>
> >>>>>>>
> >>>>>>> Forgot to add mine:
> >>>>>>> Tested-by: Kai-Heng Feng <kai.heng.feng@xxxxxxxxxxxxx>
> >>>>>>>
> >>>>>>>>
> >>>>>>>> Best regards,
> >>>>>>>> Ben Chuang
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> ---
> >>>>>>>>> Hi,
> >>>>>>>>> Kenneth R. Crudup <kenny@xxxxxxxxx>, Could you please verify this patch
> >>>>>>>>> on your laptop (Dell XPS 13) one last time?
> >>>>>>>>> IMHO, the regression observed on your laptop with an old version of the patch
> >>>>>>>>> could be due to a buggy old version BIOS in the laptop.
> >>>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>> Vidya Sagar
> >>>>>>>>>
> >>>>>>>>> drivers/pci/pci.c | 7 +++++++
> >>>>>>>>> drivers/pci/pci.h | 4 ++++
> >>>>>>>>> drivers/pci/pcie/aspm.c | 44 +++++++++++++++++++++++++++++++++++++++++
> >>>>>>>>> 3 files changed, 55 insertions(+)
> >>>>>>>>>
> >>>>>>>>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> >>>>>>>>> index cfaf40a540a8..aca05880aaa3 100644
> >>>>>>>>> --- a/drivers/pci/pci.c
> >>>>>>>>> +++ b/drivers/pci/pci.c
> >>>>>>>>> @@ -1667,6 +1667,7 @@ int pci_save_state(struct pci_dev *dev)
> >>>>>>>>> return i;
> >>>>>>>>>
> >>>>>>>>> pci_save_ltr_state(dev);
> >>>>>>>>> + pci_save_aspm_l1ss_state(dev);
> >>>>>>>>> pci_save_dpc_state(dev);
> >>>>>>>>> pci_save_aer_state(dev);
> >>>>>>>>> pci_save_ptm_state(dev);
> >>>>>>>>> @@ -1773,6 +1774,7 @@ void pci_restore_state(struct pci_dev *dev)
> >>>>>>>>> * LTR itself (in the PCIe capability).
> >>>>>>>>> */
> >>>>>>>>> pci_restore_ltr_state(dev);
> >>>>>>>>> + pci_restore_aspm_l1ss_state(dev);
> >>>>>>>>>
> >>>>>>>>> pci_restore_pcie_state(dev);
> >>>>>>>>> pci_restore_pasid_state(dev);
> >>>>>>>>> @@ -3489,6 +3491,11 @@ void pci_allocate_cap_save_buffers(struct pci_dev *dev)
> >>>>>>>>> if (error)
> >>>>>>>>> pci_err(dev, "unable to allocate suspend buffer for LTR\n");
> >>>>>>>>>
> >>>>>>>>> + error = pci_add_ext_cap_save_buffer(dev, PCI_EXT_CAP_ID_L1SS,
> >>>>>>>>> + 2 * sizeof(u32));
> >>>>>>>>> + if (error)
> >>>>>>>>> + pci_err(dev, "unable to allocate suspend buffer for ASPM-L1SS\n");
> >>>>>>>>> +
> >>>>>>>>> pci_allocate_vc_save_buffers(dev);
> >>>>>>>>> }
> >>>>>>>>>
> >>>>>>>>> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> >>>>>>>>> index e10cdec6c56e..92d8c92662a4 100644
> >>>>>>>>> --- a/drivers/pci/pci.h
> >>>>>>>>> +++ b/drivers/pci/pci.h
> >>>>>>>>> @@ -562,11 +562,15 @@ void pcie_aspm_init_link_state(struct pci_dev *pdev);
> >>>>>>>>> void pcie_aspm_exit_link_state(struct pci_dev *pdev);
> >>>>>>>>> void pcie_aspm_pm_state_change(struct pci_dev *pdev);
> >>>>>>>>> void pcie_aspm_powersave_config_link(struct pci_dev *pdev);
> >>>>>>>>> +void pci_save_aspm_l1ss_state(struct pci_dev *dev);
> >>>>>>>>> +void pci_restore_aspm_l1ss_state(struct pci_dev *dev);
> >>>>>>>>> #else
> >>>>>>>>> static inline void pcie_aspm_init_link_state(struct pci_dev *pdev) { }
> >>>>>>>>> static inline void pcie_aspm_exit_link_state(struct pci_dev *pdev) { }
> >>>>>>>>> static inline void pcie_aspm_pm_state_change(struct pci_dev *pdev) { }
> >>>>>>>>> static inline void pcie_aspm_powersave_config_link(struct pci_dev *pdev) { }
> >>>>>>>>> +static inline void pci_save_aspm_l1ss_state(struct pci_dev *dev) { }
> >>>>>>>>> +static inline void pci_restore_aspm_l1ss_state(struct pci_dev *dev) { }
> >>>>>>>>> #endif
> >>>>>>>>>
> >>>>>>>>> #ifdef CONFIG_PCIE_ECRC
> >>>>>>>>> diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
> >>>>>>>>> index a96b7424c9bc..2c29fdd20059 100644
> >>>>>>>>> --- a/drivers/pci/pcie/aspm.c
> >>>>>>>>> +++ b/drivers/pci/pcie/aspm.c
> >>>>>>>>> @@ -726,6 +726,50 @@ static void pcie_config_aspm_l1ss(struct pcie_link_state *link, u32 state)
> >>>>>>>>> PCI_L1SS_CTL1_L1SS_MASK, val);
> >>>>>>>>> }
> >>>>>>>>>
> >>>>>>>>> +void pci_save_aspm_l1ss_state(struct pci_dev *dev)
> >>>>>>>>> +{
> >>>>>>>>> + int aspm_l1ss;
> >>>>>>>>> + struct pci_cap_saved_state *save_state;
> >>>>>>>>> + u32 *cap;
> >>>>>>>>> +
> >>>>>>>>> + if (!pci_is_pcie(dev))
> >>>>>>>>> + return;
> >>>>>>>>> +
> >>>>>>>>> + aspm_l1ss = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_L1SS);
> >>>>>>>>> + if (!aspm_l1ss)
> >>>>>>>>> + return;
> >>>>>>>>> +
> >>>>>>>>> + save_state = pci_find_saved_ext_cap(dev, PCI_EXT_CAP_ID_L1SS);
> >>>>>>>>> + if (!save_state)
> >>>>>>>>> + return;
> >>>>>>>>> +
> >>>>>>>>> + cap = (u32 *)&save_state->cap.data[0];
> >>>>>>>>> + pci_read_config_dword(dev, aspm_l1ss + PCI_L1SS_CTL2, cap++);
> >>>>>>>>> + pci_read_config_dword(dev, aspm_l1ss + PCI_L1SS_CTL1, cap++);
> >>>>>>>>> +}
> >>>>>>>>> +
> >>>>>>>>> +void pci_restore_aspm_l1ss_state(struct pci_dev *dev)
> >>>>>>>>> +{
> >>>>>>>>> + int aspm_l1ss;
> >>>>>>>>> + struct pci_cap_saved_state *save_state;
> >>>>>>>>> + u32 *cap;
> >>>>>>>>> +
> >>>>>>>>> + if (!pci_is_pcie(dev))
> >>>>>>>>> + return;
> >>>>>>>>> +
> >>>>>>>>> + aspm_l1ss = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_L1SS);
> >>>>>>>>> + if (!aspm_l1ss)
> >>>>>>>>> + return;
> >>>>>>>>> +
> >>>>>>>>> + save_state = pci_find_saved_ext_cap(dev, PCI_EXT_CAP_ID_L1SS);
> >>>>>>>>> + if (!save_state)
> >>>>>>>>> + return;
> >>>>>>>>> +
> >>>>>>>>> + cap = (u32 *)&save_state->cap.data[0];
> >>>>>>>>> + pci_write_config_dword(dev, aspm_l1ss + PCI_L1SS_CTL2, *cap++);
> >>>>>>>>> + pci_write_config_dword(dev, aspm_l1ss + PCI_L1SS_CTL1, *cap++);
> >>>>>>>>> +}
> >>>>>>>>> +
> >>>>>>>>> static void pcie_config_aspm_dev(struct pci_dev *pdev, u32 val)
> >>>>>>>>> {
> >>>>>>>>> pcie_capability_clear_and_set_word(pdev, PCI_EXP_LNKCTL,
> >>>>>>>>> --
> >>>>>>>>> 2.17.1
> >>>>>>>>>
> >>>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> With this patch (and also mentioned
> >>>>>> https://lore.kernel.org/all/20220509073639.2048236-1-kai.heng.feng@xxxxxxxxxxxxx/)
> >>>>>> applied on 5.10 (chromeos-5.10) I am observing problems after
> >>>>>> suspend/resume with my WiFi card - it looks like whole communication
> >>>>>> via PCI fails. Attaching logs (dmesg, lspci -vvv before suspend/resume
> >>>>>> and after) https://gist.github.com/semihalf-majczak-lukasz/fb36dfa2eff22911109dfb91ab0fc0e3
> >>>>>>
> >>>>>> I played a little bit with this code and it looks like the
> >>>>>> pci_write_config_dword() to the PCI_L1SS_CTL1 breaks it (don't know
> >>>>>> why, not a PCI expert).
> >>>>>
> >>>>> Thanks a lot for testing this! I'm not quite sure what to make of the
> >>>>> results since v5.10 is fairly old (Dec 2020) and I don't know what
> >>>>> other changes are in chromeos-5.10.
> >>>
> >>> Lukasz: I assume you are running this on Atlas and are seeing this bug
> >>> when uprev'ving it to 5.10 kernel. Can you please try it on a newer
> >>> Intel platform that have the latest upstream kernel running already
> >>> and see if this can be reproduced there too?
> >>> Note that the wifi PCI device is different on newer Intel platforms,
> >>> but platform design is similar enough that I suspect we should see
> >>> similar bug on those too. The other option is to try the latest
> >>> ustream kernel on Atlas. Perhaps if we just care about wifi (and
> >>> ignore bringing up the graphics stack and GUI), it may come up
> >>> sufficiently enough to try this patch?
> >>>
> >>> Thanks,
> >>>
> >>> Rajat
> >>>
> >>>
> >>>>>
> >>>>> Random observations, no analysis below. This from your dmesg
> >>>>> certainly looks like PCI reads failing and returning ~0:
> >>>>>
> >>>>> Timeout waiting for hardware access (CSR_GP_CNTRL 0xffffffff)
> >>>>> iwlwifi 0000:01:00.0: 00000000: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff
> >>>>> iwlwifi 0000:01:00.0: Device gone - attempting removal
> >>>>> Hardware became unavailable upon resume. This could be a software issue prior to suspend or a hardware issue.
> >>>>>
> >>>>> And then we re-enumerate 01:00.0 and it looks like it may have been
> >>>>> reset (BAR is 0):
> >>>>>
> >>>>> pci 0000:01:00.0: [8086:095a] type 00 class 0x028000
> >>>>> pci 0000:01:00.0: reg 0x10: [mem 0x00000000-0x00001fff 64bit]
> >>>>>
> >>>>> lspci diffs from before/after suspend:
> >>>>>
> >>>>> 00:14.0 PCI bridge: Intel Corporation Celeron N3350/Pentium N4200/Atom E3900 Series PCI Express Port B #1 (rev fb) (prog-if 00 [Normal decode])
> >>>>> Bus: primary=00, secondary=01, subordinate=01, sec-latency=64
> >>>>> - DevSta: CorrErr- NonFatalErr+ FatalErr- UnsupReq+ AuxPwr+ TransPend-
> >>>>> + DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
> >>>>> - LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+
> >>>>> + LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
> >>>>> - LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1-
> >>>>> + LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete- EqualizationPhase1-
> >>>>> - Capabilities: [150 v0] Null
> >>>>> - Capabilities: [200 v1] L1 PM Substates
> >>>>> - L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
> >>>>> - PortCommonModeRestoreTime=40us PortTPowerOnTime=10us
> >>>>> - L1SubCtl1: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+
> >>>>> - T_CommonMode=40us LTR1.2_Threshold=98304ns
> >>>>> - L1SubCtl2: T_PwrOn=60us
> >>>>>
> >>>>> The DevSta differences might be BIOS bugs, probably not relevant.
> >>>>> Interesting that ASPM is disabled, maybe didn't get enabled after
> >>>>> re-enumerating 01:00.0? Strange that the L1 PM Substates capability
> >>>>> disappeared.
> >>>>>
> >>>>> 01:00.0 Network controller: Intel Corporation Wireless 7265 (rev 59)
> >>>>> LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+
> >>>>> - ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
> >>>>> + ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> >>>>> Capabilities: [154 v1] L1 PM Substates
> >>>>> L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
> >>>>> PortCommonModeRestoreTime=30us PortTPowerOnTime=60us
> >>>>> - L1SubCtl1: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+
> >>>>> - T_CommonMode=0us LTR1.2_Threshold=98304ns
> >>>>> + L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
> >>>>> + T_CommonMode=0us LTR1.2_Threshold=0ns
> >>>>>
> >>>>> Dmesg claimed we reconfigured common clock config. Maybe ASPM didn't
> >>>>> get reinitialized after re-enumeration? Looks like we didn't restore
> >>>>> L1SubCtl1.
> >>>>>
> >>>>> Bjorn
> >>>>>
> >>
> >> Hi,
> >>
> >> Thank you all for the response and input! As Rajat mentioned I'm using
> >> chromebook - but not Atlas (Amberlake) - in this case it is Babymega
> >> (Apollolake) - I will try to load most recent kernel and give it a
> >> try once again.
> >>
> >> Best regards,
> >> Lukasz
> >
> > Hi,
> >
> > I have applied this patch on top of v5.19-rc7 (chromeos) and I'm
> > still getting same results:
> > https://gist.github.com/semihalf-majczak-lukasz/4b716704c21a3758d6711b2030ea34b9
> >
> > Best regards,
> > Lukasz
> >
Hi Vidya,

Sorry for the long delay, I have retested your patch on top of
linux-next/master (next-20220802) - the results for my device remain
the same.
Here are the logs (lspci -vvv before suspend, lspci -vvv after resume and dmesg)
https://gist.github.com/semihalf-majczak-lukasz/c7bfd811359f23278034056a8002b3ef
Let me know if you need any more logs and/or tests.

Best regards,
Lukasz