Re: [Intel-gfx] [KERNEL] Regression bug in drm/i915, Wrong assumption in commit e11aa36 breaks suspend on at least lenovo x61

From: Imre Deak
Date: Thu Feb 19 2015 - 11:39:44 EST


On to, 2015-02-19 at 15:42 +0000, Dave Gordon wrote:
> On 19/02/15 11:08, Deak, Imre wrote:
> > On Thu, 2015-02-19 at 10:47 +0000, Dave Gordon wrote:
> >> On 18/02/15 16:24, Imre Deak wrote:
> >>> On ke, 2015-02-18 at 17:39 +0200, Jani Nikula wrote:
> >>>> On Tue, 17 Feb 2015, Klaus Ethgen <Klaus+lkml@xxxxxxxxx> wrote:
> >>>>> After solving the conflicts, I applied the revert (see attachment) to
> >>>>> v3.18.7. I think it should also apply to the current head. With that
> >>>>> patch, suspend is working again on that version.
> >>>>>
> >>>>> However, I have not to deep knowledge of that subsystem, so please,
> >>>>> someone who have, have a deeper look into it. I especially do not know
> >>>>> if the lines in .../intel_pm.c are correct or better leaving them as
> >>>>> they are in v3.18.7.
> >>>>>
> >>>>> I want to have it working on a version that I know is stable before
> >>>>> asking to pull it to head.
> >>>>
> >>>> Hi Klaus, we fear this patch may hide the actual cause. Would be useful
> >>>> to get a better description of what happens, along with a dmesg with
> >>>> drm.debug=14 module parameter set. This might clutter the mailing list,
> >>>> would you mind filing a bug at [1] and attach the info there?
> >>>>
> >>>> Thanks,
> >>>> Jani.
> >>>>
> >>>> [1] https://bugs.freedesktop.org/enter_bug.cgi?product=DRI&component=DRM/Intel
> >>>
> >>> In addition to the above could you also try the following patch and
> >>> provide a dmesg log on the bugzilla ticket - at this point only to try
> >>> to narrow down the issue?:
> >>>
> >>> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> >>> index d358ce8..02c65f4 100644
> >>> --- a/drivers/gpu/drm/i915/i915_irq.c
> >>> +++ b/drivers/gpu/drm/i915/i915_irq.c
> >>> @@ -4466,6 +4466,14 @@ static irqreturn_t i965_irq_handler(int irq, void *arg)
> >>> I915_DISPLAY_PLANE_A_FLIP_PENDING_INTERRUPT |
> >>> I915_DISPLAY_PLANE_B_FLIP_PENDING_INTERRUPT;
> >>>
> >>> + if (!intel_irqs_enabled(dev_priv)) {
> >>> + if (printk_ratelimit())
> >>> + DRM_ERROR("spurious/shared interrupt iir %08x imr %08x ier %08x\n",
> >>> + I915_READ(IIR), I915_READ(IMR), I915_READ(IER));
> >>> +
> >>> + return IRQ_NONE;
> >>> + }
> >>> +
> >>> iir = I915_READ(IIR);
> >>>
> >>> for (;;) {
> >>> @@ -4766,7 +4774,10 @@ void intel_runtime_pm_disable_interrupts(struct drm_device *dev)
> >>> struct drm_i915_private *dev_priv = dev->dev_private;
> >>>
> >>> dev->driver->irq_uninstall(dev);
> >>> +
> >>> + spin_lock_irq(&dev_priv->irq_lock);
> >>> dev_priv->pm._irqs_disabled = true;
> >>> + spin_unlock_irq(&dev_priv->irq_lock);
> >>> }
> >>>
> >>> /* Restore interrupts so we can recover from runtime PM. */
> >>> @@ -4774,7 +4785,10 @@ void intel_runtime_pm_restore_interrupts(struct drm_device *dev)
> >>> {
> >>> struct drm_i915_private *dev_priv = dev->dev_private;
> >>>
> >>> + spin_lock_irq(&dev_priv->irq_lock);
> >>> dev_priv->pm._irqs_disabled = false;
> >>> + spin_unlock_irq(&dev_priv->irq_lock);
> >>> +
> >>> dev->driver->irq_preinstall(dev);
> >>> dev->driver->irq_postinstall(dev);
> >>> }
> >>
> >> Surely surrounding (what ought to be) an atomic assignment to a single
> >> variable cannot make a difference? Unless it's the memory barrier
> >> semantics that have some effect? It seems unlikely that the compiler has
> >> deferred the write to the variable past the pre/postinstall calls that
> >> actually enable the h/w interrupts, but maybe we should add "volatile"
> >> just in case?
> >
> > spinlocks also serve as a memory barrier. volatile would guarantee only
> > that the compiler doesn't reorder the write, so it wouldn't be enough
> > alone. Otoh, we may also need to add spinlocking to the interrupt
> > handler if the issue turns out to be that we can't access some register
> > past/before intel_runtime_pm_{disable,enable}_interrupts.
> >
> > --Imre
>
> If we were getting interrupts during the enabling sequence, there would
> be three possibilities:
> 1. compiler has delayed writeback. This would be a compiler bug
> (IMHO) but 'volatile' might fix it.
> 2. the 'restore' thread has written the value, but it isn't seen
> by the thread handling the interrupt (on a different cpu?).
> This would be a coherency issue, and a memory barrier should
> fix it. But I would have expected the variable to be in
> coherent memory anyway.
> 3. the GPU h/w sending interrupts before they're enabled :(

For enabling I want to make sure _irqs_disabled=false is stored to
memory before the interrupt handler can run. This flag is in normal
write-back memory, hence you need a memory barrier which is provided by
the spin lock. I could have also used here mb() or rely on the implicit
memory barrier of the iowrite in preinstall/postinstall hooks; I didn't
since as I said we may want to add spin locking to the irq handler too
anyway.

> But I suspect this might be during *disabling* the interrupt. Possibly a
> race condition in which the h/w has already generated an interrupt
> before it's disabled, but that interrupt has not yet been fielded; so
> that by the time the interrupt handler runs (on a different CPU from the
> 'suspend' thread), the write to the variable can have happened and then
> the new value is seen by the interrupt handler.
>
> In which case the tweak above will reduce but not necessarily eliminate
> the window; it will ensure that if the handler has got at least as far
> as taking the spinlock before the suspend thread, then the write will be
> delayed until the handler has finished. OTOH if the interrupt were
> delayed a little bit more, this code might still get to take the
> spinlock before the handler, so the unexpected value would still be seen :(
>
> Maybe you need a two-phase uninstall?

In case we add the locking to the interrupt handler too, it doesn't
matter how much the handler is delayed. We are either before taking the
lock in intel_runtime_pm_disable_interrupts() and so the handler will
see _irqs_disabled=false, or we are after releasing the lock in
intel_runtime_pm_disable_interrupts() and the handler will see
_irqs_disabled=true and just return without doing anything.

But this patch makes already sure that the handler will see
_irqs_disabled=true eventually and I think that's enough for this test.

--Imre


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/