Re: [PATCH V1] PCI/ASPM: Save/restore L1SS Capability for suspend/resume

From: Kai-Heng Feng
Date: Tue Apr 12 2022 - 20:19:48 EST


On Wed, Apr 13, 2022 at 6:50 AM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
>
> [+cc Ricky for rtsx_pci ASPM behavior, Rajat, Prasad for L1 SS stuff,
> Victor for interest in disabling ASPM during save/restore]
>
> On Wed, Feb 16, 2022 at 06:41:39PM +0530, Vidya Sagar wrote:
> > On 2/16/2022 11:30 AM, Kenneth R. Crudup wrote:
> > > On Wed, 16 Feb 2022, Vidya Sagar wrote:
> > >
> > > > I see that the ASPM-L1 state of Realtek NIC which was in
> > > > disabled state before hibernate got enabled after hibernate.
> > >
> > > That's actually my SD-Card reader; there's a good chance the BIOS
> > > does "something" to it at boot time, as it's possible to boot from
> > > SD-Card on my laptop.
> > >
> > > > This patch doesn't do anything to LnkCtl register which has
> > > > control for ASPM L1 state.
> > >
> > > > Could you please check why ASPM L1 got enabled post hibernation?
> > >
> > > I wouldn't know how to do that; if you're still interested in that
> > > let me know what to do to determine that.
>
> > I would like Bjorn to take a call on it.
> > At this point, there are contradictions in observations.
>
> Remind me what contradictions you see? I know Kenny saw NVMe errors
> on a kernel that included 4257f7e008ea ("PCI/ASPM: Save/restore L1SS
> Capability for suspend/resume") in December 2020 [1], and that he did
> *not* see those errors on 4257f7e008ea in February 2022 [2]. Is that
> what you mean?
>
> > Just to summarize,
> > - The root ports in your laptop don't have support for L1SS
> > - With the same old code base with which the errors were observed plus my
> > patch on top of it, I see that ASPM-L1 state getting enabled for one of the
> > endpoints (Realtek SD-Card reader) after system comes out of hibernation
> > even though ASPM-L1 was disabled before the system enter into hibernation.
> > No errors are reported now.
>
> I assume you refer to [2], where on 4257f7e008ea ("PCI/ASPM:
> Save/restore L1SS Capability for suspend/resume"), Kenny saw ASPM L1
> disabled before hibernate and enabled afterwards:
>
> --- pre-hibernate
> +++ post-hibernate
> 00:1d.7 PCI bridge [0604]: Intel [8086:34b7]
> Bus: primary=00, secondary=58, subordinate=58
> LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
> 58:00.0 RTS525A PCI Express Card Reader [10ec:525a]
> - LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk-
> - ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> + LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk-
> + ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
>
> Per PCIe r6.0, sec 7.5.3.7, "ASPM L1 must be enabled by software in
> the Upstream component on a Link prior to enabling ASPM L1 in the
> Downstream component on that Link," so this definitely seems broken,
> but wouldn't explain the NVMe issue.
>
> The PCI core (pcie_config_aspm_link()) always enables L1 in the
> upstream component before the downstream one, but 58:00.0 uses the
> rtsx_pci driver, which does a lot of its own ASPM fiddling, so my
> guess is that it's doing something wrong here.
>
> > - With the linux-next top of the tree plus my patch, no change in the ASPM
> > states and no errors also reported.
>
> I don't know which report this refers to.
>
> > This points to BIOS being buggy (both old and new with new one being less
> > problematic)
>
> I agree that a BIOS change between [1] and [2] seems plausible, but I
> don't think we can prove that yet. I'm slightly queasy because while
> Kenny may have updated his BIOS, most people will not have.
>
> I think we should try this patch again with some changes and maybe
> some debug logging:
>
> - I wonder if we should integrate the LTR, L1 SS, and Link Control
> ASPM restore instead of having them spread around through
> pci_restore_ltr_state(), pci_restore_aspm_l1ss_state(), and
> pci_restore_pcie_state(). Maybe a new pci_restore_aspm() that
> would be called from pci_restore_pcie_state()?
>
> - For L1 PM Substates configuration, sec 5.5.4 says that both ports
> must be configured while ASPM L1 is disabled, but I don't think we
> currently guarantee this: we restore all the upstream component
> state first, and we don't know the ASPM state of the downstream
> one. Maybe we need to:
>
> * When restoring upstream component,
> + disable its ASPM
>
> * When restoring downstream component,
> + disable its ASPM
> + restore upstream component's LTR, L1SS
> + restore downstream component's LTR, L1SS
> + restore upstream component's ASPM
> + restore downstream component's ASPM

Right now L1SS isn't reprogrammed after S3, and that causes WD NVMe
starts to spew lots of AER errors.
So yes please restore L1SS upon resume. Meanwhile I am asking vendor
_why_ restoring L1SS is crucial for it to work.

I also wonder what's the purpose of pcie_aspm_pm_state_change()? Can't
we just restore ASPM bits like other configs?

Kai-Heng

>
> This seems pretty messy, but seems like what the spec requires.
>
> - Add some pci_dbg() logging of all these save/restore values to
> help debug any issues.
>
> Bjorn
>
> [1] https://lore.kernel.org/r/20201228040513.GA611645@bjorn-Precision-5520
> [2] https://lore.kernel.org/r/3ca14a7-b726-8430-fe61-a3ac183a1088@xxxxxxxxx