Re: 5.13 i915/PAT regression on Brasswell, adding nopat to the kernel commandline worksaround this

From: Peter Zijlstra
Date: Wed May 12 2021 - 07:16:16 EST


On Wed, May 12, 2021 at 11:57:02AM +0200, Hans de Goede wrote:
> Hi All,
>
> I'm not sure if this is a i915 bug, or caused by changes elsewhere in the kernel,
> so I thought it would be best to just send out an email and then see from there.
>
> With 5.13-rc1 gdm fails to show and dmesg contains:
>
> [ 38.504613] x86/PAT: Xwayland:683 map pfn RAM range req write-combining for [mem 0x23883000-0x23883fff], got write-back
> <repeated lots of times for different ranges>
> [ 39.484766] x86/PAT: gnome-shell:632 map pfn RAM range req write-combining for [mem 0x1c6a3000-0x1c6a3fff], got write-back
> <repeated lots of times for different ranges>
> [ 54.314858] Asynchronous wait on fence 0000:00:02.0:gnome-shell[632]:a timed out (hint:intel_cursor_plane_create [i915])
> [ 58.339769] i915 0000:00:02.0: [drm] GPU HANG: ecode 8:1:86dfdffb, in gnome-shell [632]
> [ 58.341161] i915 0000:00:02.0: [drm] Resetting rcs0 for stopped heartbeat on rcs0
> [ 58.341267] i915 0000:00:02.0: [drm] gnome-shell[632] context reset due to GPU hang
>
> Because of the PAT errors I tried adding "nopat" to the kernel commandline
> and I'm happy to report that that works around this.
>
> Any hints on how to debug this further (without doing a full git bisect) would be
> appreciated.

IIRC it's because of 74ffa5a3e685 ("mm: add remap_pfn_range_notrack"),
which added a sanity check to make sure expectations were met. It turns
out they were not.

The bug is not new, the warning is. AFAIK the i915 team is aware, but
other than that I've not followed.