Re: [PATCH v2] mmc: sdhci-pci-gli: GL975x: Mask rootport's replay timer timeout during suspend

From: Kai-Heng Feng
Date: Thu Mar 21 2024 - 06:06:09 EST


Hi Bjorn,

Sorry for the belated response.

On Sat, Jan 20, 2024 at 6:41 AM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
>
> On Thu, Jan 18, 2024 at 02:40:50PM +0800, Kai-Heng Feng wrote:
> > On Sat, Jan 13, 2024 at 1:37 AM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
> > > On Fri, Jan 12, 2024 at 01:14:42PM +0800, Kai-Heng Feng wrote:
> > > > On Sat, Jan 6, 2024 at 5:19 AM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
> > > > > On Thu, Dec 21, 2023 at 11:21:47AM +0800, Kai-Heng Feng wrote:
> > > > > > Spamming `lspci -vv` can still observe the replay timer timeout error
> > > > > > even after commit 015c9cbcf0ad ("mmc: sdhci-pci-gli: GL9750: Mask the
> > > > > > replay timer timeout of AER"), albeit with a lower reproduce rate.
> > > > >
> > > > > I'm not sure what this is telling me. By "spamming `lspci -vv`, do
> > > > > you mean that if you run lspci continually, you still see Replay Timer
> > > > > Timeout logged, e.g.,
> > > > >
> > > > > CESta: ... Timeout+
> > > >
> > > > Yes it's logged and the AER IRQ is raised.
> > >
> > > IIUC the AER IRQ is the important thing.
> > >
> > > Neither 015c9cbcf0ad nor this patch affects logging in
> > > PCI_ERR_COR_STATUS, so the lspci output won't change and mentioning it
> > > here doesn't add useful information.
> >
> > You are right. That's just a way to access config space to reproduce
> > the issue.
>
> Oh, I think I completely misunderstood you! I thought you were saying
> that suspending the device caused the PCI_ERR_COR_REP_TIMER error, and
> you happened to see that it was logged when you ran lspci.

Both running lspci and suspending the device can observe the error,
because both are accessing the config space.

>
> But I guess you mean that running lspci actually *causes* the error?
> I.e., lspci does a config access while we're suspending the device
> causes the error, and the config access itself causes the error, which
> causes the ERR_COR message and ultimately the AER interrupt, and that
> interrupt prevents the system suspend.

My point was that any kind of PCI config access can cause the error.
Using lspci is just make the error more easier to reproduce.

>
> If that's the case, I wonder if this is a generic problem that could
> happen with *any* device, not just GL975x.

For now, it's just GL975x.

>
> What power state do we put the GL975x in during system suspend?
> D3hot? D3cold? Is there anything that prevents config access while
> we suspend it?

The target device state is D3hot.
However, the issue happens when the devices is in D0, when the PCI
core is saving the device's config space.

So I think the issue isn't related to the device state.

>
> We do have dev->block_cfg_access, and there's a comment that says
> "we're required to prevent config accesses during D-state
> transitions," but I don't see it being used during D-state
> transitions.

Yes, there isn't any D-state change happens here.

Kai-Heng

>
> Also, it doesn't seem suitable for preventing config accesses during
> suspend because pci_wait_cfg() just busy-waits and never returns an
> error.
>
> Bjorn