Re: [PATCH] pci/quirks: Add quirk to reset nvgpu at boot for the Lenovo ThinkPad P50

From: Lyude Paul
Date: Wed Apr 24 2019 - 19:03:22 EST


On Wed, 2019-04-24 at 17:36 -0500, Bjorn Helgaas wrote:
> On Wed, Apr 24, 2019 at 03:16:37PM -0400, Lyude Paul wrote:
> > On Wed, 2019-04-24 at 13:59 -0500, Bjorn Helgaas wrote:
> > > Not being a scheduled work expert, I was unsure if this experiment was
> > > equivalent to what I proposed.
> > >
> > > I'm always suspicious of singleton solutions like this (using
> > > schedule_work() in runtime_resume()) because usually they seem to be
> > > solving a generic problem that should happen on many kinds of
> > > hardware. The 0b2fe6594fa2 ("drm/nouveau: Queue hpd_work on (runtime)
> > > resume") commit log says:
> > >
> > > We need to call drm_helper_hpd_irq_event() on resume to properly
> > > detect monitor connection / disconnection on some laptops, use
> > > hpd_work for this to avoid deadlocks.
> > >
> > > The situation of a monitor being connected or disconnected during
> > > suspend can happen to *any* GPU, but the commit only changes nouveau,
> > > which of course raises the question of how we deal with that in other
> > > drivers. If the Nvidia GPU has some unique behavior related to
> > > monitor connection, that would explain special-case code there, but
> > > the commit doesn't mention anything like that.
> > >
> > > It should be simple to revert 0b2fe6594fa2 and see whether it changes
> > > the behavior at all (well, simple except for the fact that this
> > > problem isn't 100% reproducible in the first place).
> >
> > It's not 100% reproducible, but it's at least 90% so it's not
> > difficult for me to test at all.
> >
> > Also, reverting this commit makes no difference either.
>
> OK, great, that makes it crystal clear. I didn't know you had
> specifically tested that revert. Thanks for doing that.
>
> > Note that while that commit only changed nouveau, scheduled_work()
> > is exactly how a number of other drivers (i915 for instance) handle
> > reprobing like this as well.
>
> OK. The GPU code would be a lot more approachable if similar things
> were done in similar ways. I spent an hour or so looking for this
> similar code in i915, but gave up.

We try
>
> > The reason being that we can't do full connector reprobing in our
> > runtime resume thread because we could deadlock if someone else is
> > holding a modesetting lock we need and waiting on us to resume at
> > the same time (there's a number of other bug fixes in nouveau for
> > other issues caused by the same deadlock scenario).
>
> You mention nouveau specifically here, but I assume this is a generic
> deadlock scenario that applies to any GPU, and they all avoid the
> deadlock in the same way. Right?

Yes-there is only one exception in the tree that I know of (amdgpu/radeon),
but fixing that is on my todo if someone else hasn't already gotten to it yet.

>
> > I'm confused here though, it sounds like you're running under the
> > assumption that PCI devices like this aren't reset into a clean
> > state during a system reboot, is that correct?
>
> No, I wasn't trying to say anything about that. My point here is
> that:
>
> - you're reporting a problem that only happens with nouveau and
> only happens during shutdown/reboot
> - the behavior is similar to a race (not 100% reproducible, seems
> to happen more if shutdown is faster)
> - shutdown involves resuming the device (see pci_device_shutdown())
> - nouveau_pmops_runtime_resume() schedules asynchronous work, which
> (to my untrained eye) looks unusual
> - asynchronous work is inherently subject to races
>
> So I think that's all somewhat suspicious. But if the same problem
> happens without the asynchronous work, obviously the issue is
> elsewhere.
>
> But you *are* right that if the device were actually reset, none of
> this should matter. It certainly seems that the BIOS neglects to
> reset it in some cases.
>
> I can sort of imagine a BIOS engineer thinking that if the device
> looks like it's in use, we shouldn't reset it, and it's still
> conceivable that some sort of Linux shutdown race could leave it
> looking like it's in use. But you've been working with Lenovo on
> this, and it seems like that would be pretty obvious to somebody with
> the BIOS source (though I just demonstrated above that even with the
> source it's easy to miss things).

Yeah, that's what I thought as well! This experience has taught me that
there's a surprising amount of things about BIOSes that BIOS engineers don't
seem to know aboutâ

>
> I'm out of ideas, so I think your quirk is the best way forward. I
> trimmed out some of the commit log backtraces and such, added the
> bugzilla, and tweaked the patch to use pci_iomap() instead of
> ioremap(). Would the patch below work for you?

Works for me, tested on the problematic P50 as well.

>
>
> commit 18dc5b3c7ddc
> Author: Lyude Paul <lyude@xxxxxxxxxx>
> Date: Tue Feb 12 17:02:30 2019 -0500
>
> PCI: Reset Lenovo ThinkPad P50 nvgpu at boot if necessary
>
> On ThinkPad P50 SKUs with an Nvidia Quadro M1000M instead of the M2000M
> variant, the BIOS does not always reset the secondary Nvidia GPU during
> reboot if the laptop is configured in Hybrid Graphics mode. The reason
> is
> unknown, but the following steps and possibly a good bit of patience
> will
> reproduce the issue:
>
> 1. Boot up the laptop normally in Hybrid Graphics mode
> 2. Make sure nouveau is loaded and that the GPU is awake
> 2. Allow the Nvidia GPU to runtime suspend itself after being idle
> 3. Reboot the machine, the more sudden the better (e.g sysrq-b may
> help)
> 4. If nouveau loads up properly, reboot the machine again and go back
> to
> step 2 until you reproduce the issue
>
> This results in some very strange behavior: the GPU will be left in
> exactly
> the same state it was in when the previously booted kernel started the
> reboot. This has all sorts of bad side effects: for starters, this
> completely breaks nouveau starting with a mysterious EVO channel failure
> that happens well before we've actually used the EVO channel for
> anything:
>
> nouveau 0000:01:00.0: disp: chid 0 mthd 0000 data 00000400 00001000
> 00000002
>
> This causes a timeout trying to bring up the GR ctx:
>
> nouveau 0000:01:00.0: timeout
> WARNING: CPU: 0 PID: 12 at
> drivers/gpu/drm/nouveau/nvkm/engine/gr/ctxgf100.c:1547
> gf100_grctx_generate+0x7b2/0x850 [nouveau]
> Hardware name: LENOVO 20EQS64N0B/20EQS64N0B, BIOS N1EET82W (1.55 )
> 12/18/2018
> Workqueue: events_long drm_dp_mst_link_probe_work [drm_kms_helper]
> ...
> nouveau 0000:01:00.0: gr: wait for idle timeout (en: 1, ctxsw: 0,
> busy: 1)
> nouveau 0000:01:00.0: gr: wait for idle timeout (en: 1, ctxsw: 0,
> busy: 1)
> nouveau 0000:01:00.0: fifo: fault 01 [WRITE] at 0000000000008000
> engine 00 [GR] client 15 [HUB/SCC_NB] reason c4 [] on channel -1 [0000000000
> unknown]
>
> The GPU never manages to recover. Booting without loading nouveau
> causes
> issues as well, since the GPU starts sending spurious interrupts that
> cause
> other device's IRQs to get disabled by the kernel:
>
> irq 16: nobody cared (try booting with the "irqpoll" option)
> ...
> handlers:
> [<000000007faa9e99>] i801_isr [i2c_i801]
> Disabling IRQ #16
> ...
> serio: RMI4 PS/2 pass-through port at rmi4-00.fn03
> i801_smbus 0000:00:1f.4: Timeout waiting for interrupt!
> i801_smbus 0000:00:1f.4: Transaction timeout
> rmi4_f03 rmi4-00.fn03: rmi_f03_pt_write: Failed to write to F03 TX
> register (-110).
> i801_smbus 0000:00:1f.4: Timeout waiting for interrupt!
> i801_smbus 0000:00:1f.4: Transaction timeout
> rmi4_physical rmi4-00: rmi_driver_set_irq_bits: Failed to change
> enabled interrupts!
>
> This causes the touchpad and sometimes other things to get disabled.
>
> Since this happens without nouveau, we can't fix this problem from
> nouveau
> itself.
>
> Add a PCI quirk for the specific P50 variant of this GPU. Make sure the
> GPU is advertising NoReset- so we don't reset the GPU when the machine
> is
> in Dedicated graphics mode (where the GPU being initialized by the BIOS
> is
> normal and expected). Map the GPU MMIO space and read the magic 0x2240c
> register, which will have bit 1 set if the device was POSTed during a
> previous boot. Once we've confirmed all of this, reset the GPU and
> re-disable it - bringing it back to a healthy state.
>
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=203003
> Link:
> https://lore.kernel.org/lkml/20190212220230.1568-1-lyude@xxxxxxxxxx
> Signed-off-by: Lyude Paul <lyude@xxxxxxxxxx>
> Signed-off-by: Bjorn Helgaas <bhelgaas@xxxxxxxxxx>
> Cc: nouveau@xxxxxxxxxxxxxxxxxxxxx
> Cc: dri-devel@xxxxxxxxxxxxxxxxxxxxx
> Cc: Karol Herbst <kherbst@xxxxxxxxxx>
> Cc: Ben Skeggs <skeggsb@xxxxxxxxx>
> Cc: stable@xxxxxxxxxxxxxxx
>
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index a59ad09ce911..819a595b0b1d 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -5120,3 +5120,61 @@ SWITCHTEC_QUIRK(0x8573); /* PFXI 48XG3 */
> SWITCHTEC_QUIRK(0x8574); /* PFXI 64XG3 */
> SWITCHTEC_QUIRK(0x8575); /* PFXI 80XG3 */
> SWITCHTEC_QUIRK(0x8576); /* PFXI 96XG3 */
> +
> +/*
> + * On Lenovo Thinkpad P50 SKUs with a Nvidia Quadro M1000M, the BIOS does
> + * not always reset the secondary Nvidia GPU between reboots if the system
> + * is configured to use Hybrid Graphics mode. This results in the GPU
> + * being left in whatever state it was in during the *previous* boot, which
> + * causes spurious interrupts from the GPU, which in turn causes us to
> + * disable the wrong IRQ and end up breaking the touchpad. Unsurprisingly,
> + * this also completely breaks nouveau.
> + *
> + * Luckily, it seems a simple reset of the Nvidia GPU brings it back to a
> + * clean state and fixes all these issues.
> + *
> + * When the machine is configured in Dedicated display mode, the issue
> + * doesn't occur. Fortunately the GPU advertises NoReset+ when in this
> + * mode, so we can detect that and avoid resetting it.
> + */
> +static void quirk_reset_lenovo_thinkpad_p50_nvgpu(struct pci_dev *pdev)
> +{
> + void __iomem *map;
> + int ret;
> +
> + if (pdev->subsystem_vendor != PCI_VENDOR_ID_LENOVO ||
> + pdev->subsystem_device != 0x222e ||
> + !pdev->reset_fn)
> + return;
> +
> + if (pci_enable_device_mem(pdev))
> + return;
> +
> + /*
> + * Based on nvkm_device_ctor() in
> + * drivers/gpu/drm/nouveau/nvkm/engine/device/base.c
> + */
> + map = pci_iomap(pdev, 0, 0x23000);
> + if (!map) {
> + pci_err(pdev, "Can't map MMIO space\n");
> + goto out_disable;
> + }
> +
> + /*
> + * Make sure the GPU looks like it's been POSTed before resetting
> + * it.
> + */
> + if (ioread32(map + 0x2240c) & 0x2) {
> + pci_info(pdev, FW_BUG "GPU left initialized by EFI,
> resetting\n");
> + ret = pci_reset_function(pdev);
> + if (ret < 0)
> + pci_err(pdev, "Failed to reset GPU: %d\n", ret);
> + }
> +
> + iounmap(map);
> +out_disable:
> + pci_disable_device(pdev);
> +}
> +DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_VENDOR_ID_NVIDIA, 0x13b1,
> + PCI_CLASS_DISPLAY_VGA, 8,
> + quirk_reset_lenovo_thinkpad_p50_nvgpu);
--
Cheers,
Lyude Paul