Re: [PATCH] pci/quirks: Add quirk to reset nvgpu at boot for the Lenovo ThinkPad P50

From: Lyude Paul
Date: Fri Mar 22 2019 - 19:50:32 EST


Note: I did read your response lower down in the thread, but I wanted to make
sure I addressed one of the comments here (see below)

On Thu, 2019-03-21 at 17:48 -0500, Bjorn Helgaas wrote:
> [+cc Rafael]
>
> On Wed, Mar 13, 2019 at 06:25:02PM -0400, Lyude Paul wrote:
> > On Fri, 2019-02-15 at 16:17 -0500, Lyude Paul wrote:
> > > On Thu, 2019-02-14 at 18:43 -0600, Bjorn Helgaas wrote:
> > > > On Tue, Feb 12, 2019 at 05:02:30PM -0500, Lyude Paul wrote:
> > > > > On a very specific subset of ThinkPad P50 SKUs, particularly
> > > > > ones that come with a Quadro M1000M chip instead of the M2000M
> > > > > variant, the BIOS seems to have a very nasty habit of not
> > > > > always resetting the secondary Nvidia GPU between full reboots
> > > > > if the laptop is configured in Hybrid Graphics mode. The
> > > > > reason for this happening is unknown, but the following steps
> > > > > and possibly a good bit of patience will reproduce the issue:
> > > > >
> > > > > 1. Boot up the laptop normally in Hybrid graphics mode
> > > > > 2. Make sure nouveau is loaded and that the GPU is awake
> > > > > 2. Allow the nvidia GPU to runtime suspend itself after being idle
> > > > > 3. Reboot the machine, the more sudden the better (e.g sysrq-b may
> > > > > help)
> > > > > 4. If nouveau loads up properly, reboot the machine again and go
> > > > > back to
> > > > > step 2 until you reproduce the issue
> > > > >
> > > > > This results in some very strange behavior: the GPU will quite
> > > > > literally be left in exactly the same state it was in when the
> > > > > previously booted kernel started the reboot. This has all
> > > > > sorts of bad sideaffects: for starters, this completely breaks
> > > > > nouveau starting with a mysterious EVO channel failure that
> > > > > happens well before we've actually used the EVO channel for
> > > > > anything:
>
> Thanks for the hybrid tutorial (snipped from this response). IIUC,
> what you said was that in hybrid mode, the Intel GPU drives the
> built-in display and the Nvidia GPU drives any external displays and
> may be used for DRI PRIME rendering (whatever that is). But since you
> say the Nvidia device gets runtime suspended, I assume there's no
> external display here and you're not using DRI PRIME.
>
> I wonder if it's related to the fact that the Nvidia GPU has been
> runtime suspended before you do the reboot. Can you try turning of
> runtime power management for the GPU by setting the runpm module
> parameter to 0? I *think* this would be booting with
> "nouveau.runpm=0".
>
> > > > Is there a bug report for this? Bugzilla.kernel.org would be ideal,
> > > > including "lspci -vvxxx" and dmidecode for the system.
> > > >
> > > Not yet, but there has been discussion about this between nouveau
> > > developers on our IRC channel.
> >
> > I lied: yes there actually is a bug report for this, but it's
> > currently on the Red Hat bugzilla. I can get more information from
> > it if you need (with lenovo's approval of course).
>
> Can you please make a bugzilla.kernel.org entry with as much
> information (dmesg, "lspci -vvxxx", dmidecode, etc) as you can get
> approval for? You can include the Red Hat bugzilla URL in the commit
> log, too, but that's not quite as good because we have no control over
> whether it's public.
>
> > And additionally: I've been working with Lenovo on this issue for a
> > couple of months now, and we've gone through dozens of different
> > trial BIOSes with no success thus far. However, Lenovo is currently
> > working on trying to add this workaround into their BIOS but I've
> > been told that this change is going to take a decent amount of time
> > since they need to test it across multiple operating systems. I'd be
> > happy to come back and add a conditional later to turn this
> > workaround off for later BIOS versions once Lenovo has released a
> > proper fix.
>
> Sounds like Lenovo is going to a lot of trouble for this. The ideal
> thing from my point of view would be if they could figure out why this
> works on Windows but not on Linux. I doubt Windows has a quirk like
> this, so if we could figure out why it works on Windows, we could
> likely do something similar in Linux.

I did actually try this route after first finding this bug, but unfortunately
from what I understand there isn't really much more Lenovo can do other then
give us a patched BIOS or look at their own BIOS to see if it's the cause.

Anyway, went ahead and filed a bug with as much information as I could get my
hands on here (different email then the one I'm talking to you from):
https://bugzilla.kernel.org/show_bug.cgi?id=203003

>
> > > > > So to do this, we add a new pci quirk using
> > > > > DECLARE_PCI_FIXUP_CLASS_FINAL that will be invoked before the PCI
> > > > > probe
> > > > > at boot finishes. From there, we check to make sure that this is
> > > > > indeed
> > > > > the specific P50 variant of this GPU. We also make sure that the GPU
> > > > > PCI
> > > > > device is advertising NoReset- in order to prevent us from trying to
> > > > > reset the GPU when the machine is in Dedicated graphics mode (where
> > > > > the
> > > > > GPU being initialized by the BIOS is normal and expected). Finally,
> > > > > we
> > > > > try mapping the MMIO space for the GPU which should only work if the
> > > > > GPU
> > > > > is actually active in D0 mode. We can then read the magic 0x2240c
> > > > > register on the GPU, which will have bit 1 set if the GPU's firmware
> > > > > has
> > > > > already been posted during a previous boot. Once we've confirmed all
> > > > > of
> > > > > this, we reset the PCI device and re-disable it - bringing the GPU
> > > > > back
> > > > > into a healthy state.
> > > > >
> > > > > Signed-off-by: Lyude Paul <lyude@xxxxxxxxxx>
> > > > > Cc: nouveau@xxxxxxxxxxxxxxxxxxxxx
> > > > > Cc: dri-devel@xxxxxxxxxxxxxxxxxxxxx
> > > > > Cc: Karol Herbst <kherbst@xxxxxxxxxx>
> > > > > Cc: Ben Skeggs <skeggsb@xxxxxxxxx>
> > > > > Cc: stable@xxxxxxxxxxxxxxx
> > > > > ---
> > > > > drivers/pci/quirks.c | 65
> > > > > ++++++++++++++++++++++++++++++++++++++++++++
> > > > > 1 file changed, 65 insertions(+)
> > > > >
> > > > > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> > > > > index b0a413f3f7ca..948492fda8bf 100644
> > > > > --- a/drivers/pci/quirks.c
> > > > > +++ b/drivers/pci/quirks.c
> > > > > @@ -5117,3 +5117,68 @@ SWITCHTEC_QUIRK(0x8573); /* PFXI 48XG3 */
> > > > > SWITCHTEC_QUIRK(0x8574); /* PFXI 64XG3 */
> > > > > SWITCHTEC_QUIRK(0x8575); /* PFXI 80XG3 */
> > > > > SWITCHTEC_QUIRK(0x8576); /* PFXI 96XG3 */
> > > > > +
> > > > > +/*
> > > > > + * On certain Lenovo Thinkpad P50 SKUs, specifically those with a
> > > > > Nvidia
> > > > > + * Quadro M1000M, the BIOS will occasionally make the mistake of
> > > > > not
> > > > > resetting
> > > > > + * the nvidia GPU between reboots if the system is configured to
> > > > > use
> > > > > hybrid
> > > > > + * graphics mode. This results in the GPU being left in whatever
> > > > > state
> > > > > it
> > > > > was
> > > > > + * in during the previous boot which causes spurious interrupts
> > > > > from
> > > > > the
> > > > > GPU,
> > > > > + * which in turn cause us to disable the wrong IRQs and end up
> > > > > breaking
> > > > > the
> > > > > + * touchpad. Unsurprisingly, this also completely breaks nouveau.
> > > > > + *
> > > > > + * Luckily, it seems a simple reset of the PCI device for the
> > > > > nvidia
> > > > > GPU
> > > > > + * manages to bring the GPU back into a clean state and fix all of
> > > > > these
> > > > > + * issues. Additionally since the GPU will report NoReset+ when the
> > > > > machine is
> > > > > + * configured in Dedicated display mode, we don't need to worry
> > > > > about
> > > > > + * accidentally resetting the GPU when it's supposed to already be
> > > > > + * initialized.
> > > > > + */
> > > > > +static void
> > > > > +quirk_lenovo_thinkpad_p50_nvgpu_survives_reboot(struct pci_dev
> > > > > *pdev)
> > > > > +{
> > > > > + void __iomem *map;
> > > > > + int ret;
> > > > > +
> > > > > + if (pdev->subsystem_vendor != PCI_VENDOR_ID_LENOVO ||
> > > > > + pdev->subsystem_device != 0x222e ||
> > > > > + !pdev->reset_fn)
> > > > > + return;
> > > > > +
> > > > > + /*
> > > > > + * If we can't enable the device's mmio space, it's probably
> > > > > not even
> > > > > + * initialized. This is fine, and means we can just skip the
> > > > > quirk
> > > > > + * entirely.
> > > > > + */
> > > > > + if (pci_enable_device_mem(pdev)) {
> > > > > + pci_dbg(pdev, "Can't enable device mem, no reset
> > > > > needed\n");
> > > > > + return;
> > > > > + }
> > > > > +
> > > > > + /* Taken from drivers/gpu/drm/nouveau/engine/device/base.c */
> > > > > + map = ioremap(pci_resource_start(pdev, 0), 0x102000);
> > > > > + if (!map) {
> > > > > + pci_err(pdev, "Can't map MMIO space, this is probably
> > > > > very
> > > > > bad\n");
> > > > > + goto out_disable;
> > > > > + }
> > > > > +
> > > > > + /*
> > > > > + * Be extra careful, and make sure that the GPU firmware is
> > > > > posted
> > > > > + * before trying a reset
> > > > > + */
> > > > > + if (ioread32(map + 0x2240c) & 0x2) {
> > > > > + pci_info(pdev,
> > > > > + FW_BUG "GPU left initialized by EFI,
> > > > > resetting\n");
> > > > > + ret = pci_reset_function(pdev);
> > > > > + if (ret < 0)
> > > > > + pci_err(pdev, "Failed to reset GPU: %d\n",
> > > > > ret);
> > > > > + }
> > > > > +
> > > > > + iounmap(map);
> > > > > +out_disable:
> > > > > + pci_disable_device(pdev);
> > > > > +}
> > > > > +
> > > > > +DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_VENDOR_ID_NVIDIA, 0x13b1,
> > > > > + PCI_CLASS_DISPLAY_VGA, 8,
> > > > > + quirk_lenovo_thinkpad_p50_nvgpu_survives
> > > > > _reboot)
> > > > > ;
> > > > > --
> > > > > 2.20.1
> > > > >
> > --
> > Cheers,
> > Lyude Paul
> >
--
Cheers,
Lyude Paul