Re: i915 lockup / extreme delay

From: Andy Lutomirski
Date: Thu Apr 01 2010 - 09:09:56 EST


Karl Vogel wrote:
On Mon, Mar 22, 2010 at 4:34 PM, Eric Anholt <eric@xxxxxxxxxx> wrote:
On Mon, 22 Mar 2010 09:11:06 +0100, Karl Vogel <karl.vogel@xxxxxxxxx> wrote:
On Mon, Mar 22, 2010 at 5:20 AM, Eric Anholt <eric@xxxxxxxxxx> wrote:
On Sat, 20 Mar 2010 14:41:41 +0100, Karl Vogel <karl.vogel@xxxxxxxxx> wrote:
The 'effect' is that only the mouse pointer works in the X server. The
cpu usage on the laptop during the sluggishness is minimal. When I
suspend the game with winedbg, the X server slowly becomes responsive again.

The output from latencytop seems to point to i915 being the culprit:
If there's some code doing glFlush()es, it's probably that code at
fault. You don't need to do that unless you're doing frontbuffer
rendering, and if you're doing frontbuffer rendering you should really
be doing backbuffer rendering. I don't see a kernel issue here.
That doesnt explain why the box completely locks up on 2.6.34-rc2
though, where only a cold reboot works.
Missed that part of the message. If there's a regression, bisect
please.

Apparently the crash was caused by a hardware bug in the intel chipset
which is 8086:2a40 rev 07. While doing the bisect I got an error:

DRHD: handling fault status reg 2
DMAR:[DMA Write] Request device [00:02.0] fault addr dd69a000
DMAR:[fault reason 05] PTE Write access is not set

After some googling around, I found this bugzilla entry which explains it:

https://bugzilla.redhat.com/show_bug.cgi?id=538163#c58

The issue appears that the graphics chip is corrupting memory:

"Unfortunately, this particular chipset sometimes reads from the GTT, does the
translation, then writes the translated address back to the _original_ GTT
instead of to the shadow GTT. That's why you're seeing real physical addresses
where you should have 'virtual DMA addresses', and you get the faults. "

Adding "intel_iommu=igfx_off" to the kernel command line resolved the issue.
The fedora kernel automatically disables this when it detects this particular
chipset revision.

As for the freeze/slowdown right after booting, sysprof shows that more than 77%
of the time is spent inside: drm_mode_getconnector

http://lists.freedesktop.org/archives/intel-gfx/2010-February/005922.html

I'm waiting for the encoder/connector stuff to get merged before I either pester people about that bug again or try to fix it myself.

You can try the same hack I use (comment out the initialization of all digital outputs) if you don't use them -- that completely fixes it for me.

--Andy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/