Re: [PATCH v2] PCI: vmd: Enable PCI PM's L1 substates of remapped PCIe Root Port and NVMe

From: David E. Box
Date: Wed Feb 07 2024 - 14:51:53 EST


On Tue, 2024-02-06 at 17:30 -0600, Bjorn Helgaas wrote:
> On Tue, Feb 06, 2024 at 01:25:29PM -0800, David E. Box wrote:
> > On Mon, 2024-02-05 at 15:05 -0800, David E. Box wrote:
> > > On Mon, 2024-02-05 at 16:42 -0600, Bjorn Helgaas wrote:
> > > > On Mon, Feb 05, 2024 at 11:37:16AM -0800, David E. Box wrote:
> > > > > On Fri, 2024-02-02 at 18:05 -0600, Bjorn Helgaas wrote:
> > > > > > On Fri, Feb 02, 2024 at 03:11:12PM +0800, Jian-Hong Pan wrote:
> > > > > ...
> > > >
> > > > > > > @@ -775,6 +773,14 @@ static int vmd_pm_enable_quirk(struct pci_dev
> > > > > > > *pdev,
> > > > > > > void *userdata)
> > > > > > >         pci_write_config_dword(pdev, pos + PCI_LTR_MAX_SNOOP_LAT,
> > > > > > > ltr_reg);
> > > > > > >         pci_info(pdev, "VMD: Default LTR value set by driver\n");
> > > > > >
> > > > > > You're not changing this part, and I don't understand exactly how
> > > > > > LTR
> > > > > > works, but it makes me a little bit queasy to read "set the LTR
> > > > > > value
> > > > > > to the maximum required to allow the deepest power management
> > > > > > savings" and then we set the max snoop values to a fixed constant.
> > > > > >
> > > > > > I don't think the goal is to "allow the deepest power savings"; I
> > > > > > think it's to enable L1.2 *when the device has enough buffering to
> > > > > > absorb L1.2 entry/exit latencies*.
> > > > > >
> > > > > > The spec (PCIe r6.0, sec 7.8.2.2) says "Software should set this to
> > > > > > the platform's maximum supported latency or less," so it seems like
> > > > > > that value must be platform-dependent, not fixed.
> > > > > >
> > > > > > And I assume the "_DSM for Latency Tolerance Reporting" is part of
> > > > > > the
> > > > > > way to get those platform-dependent values, but Linux doesn't
> > > > > > actually
> > > > > > use that yet.
> > > > >
> > > > > This may indeed be the best way but we need to double check with our
> > > > > BIOS folks.  AFAIK BIOS writes the LTR values directly so there
> > > > > hasn't been a need to use this _DSM. But under VMD the ports are
> > > > > hidden from BIOS which is why we added it here. I've brought up the
> > > > > question internally to find out how Windows handles the DSM and to
> > > > > get a recommendation from our firmware leads.
> > > >
> > > > We want Linux to be able to program LTR itself, don't we?  We
> > > > shouldn't have to rely on firmware to do it.  If Linux can't do
> > > > it, hot-added devices aren't going to be able to use L1.2,
> > > > right?
> > >
> > > Agreed. We just want to make sure we are not conflicting with what
> > > BIOS may be doing.
> >
> > So the feedback is to run the _DSM and just overwrite any BIOS
> > values. Looking up the _DSM I saw there was an attempt to upstream
> > this 4 years ago [1]. I'm not sure why the effort stalled but we can
> > pick up this work again.
> >
> > https://patchwork.kernel.org/project/linux-pci/patch/20201015080311.7811-1-puranjay12@xxxxxxxxx/
>
> There was a PCI SIG discussion about this a few years ago that never
> really seemed to get resolved:
> https://members.pcisig.com/wg/PCIe-Protocol/mail/thread/35064
>
> Unfortunately that discussion is not public, but the summary is:
>
>   Q: How is the LTR_L1.2_THRESHOLD value determined?
>
>      PCIe r5.0, sec 5.5.4, says the same value must be programmed into
>      both Ports.
>
>      A: As noted in sec 5.5.4, the value is determined primarily by
>         the amount of time it will take to re-establish the common
>         mode bias on the AC coupling caps, and it is assumed that the
>         BIOS knows this.
>
>   Q: How are the LTR Max Snoop values determined?
>
>      PCI Firmware r3.3, sec 4.6.6, says the LTR _DSM reports the max
>      values for each Downstream Port embedded in the platform, and the
>      OS should calculate latencies along the path between each
>      Downstream Port and any Upstream Port (Switch Upstream Port or
>      Endpoint).
>
>      Of course, Switches not embedded in the platform (e.g., external
>      Thunderbolt hierarchies) will not have this _DSM, but I assume
>      they should contribute to this sum?
>
>      A: The fundamental problem is that there is no practical way for
>         software to discover the AC coupling capacitor sizes and
>         common mode bias circuit impedance.
>
>         Software could compute conservative values, but they would
>         likely be 10x worse than typical, so the L1.2 exit latency
>         would be significantly longer than actually required to be.
>
>         The interoperability issues here were understood when
>         designing L1 Substates, but no viable solution was found.
>
> So the main reason Puranjay's work got stalled is that I didn't feel
> confident enough that we understood how to do this, especially for
> external devices.
>
> It would be great if somebody *did* feel confident about interpreting
> and implementing all this.

As it is BIOS (at least Intel BIOS) is already writing the maximum allowed LTR
value on Upstream Ports that have it set to 0. So we can't do any worse if we
write the BIOS provided _DSM value for all Upstream Ports, including external
devices. Sounds like the worst case scenario is that devices take longer than
needed to exit L1.2 (I'm still asking about this detail). But I think this is
better than not programming the LTR at all which could prevent the platform from
power gating they very resources that LTR is meant to help manage.

If that reasoning is okay with you, I'll submit patches to use the _DSM.

David