Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf

From: Andy Lutomirski
Date: Fri Sep 08 2017 - 17:47:27 EST


On Fri, Sep 8, 2017 at 10:16 AM, Markus Trippelsdorf
<markus@xxxxxxxxxxxxxxx> wrote:
> On 2017.09.08 at 09:12 -0700, Andy Lutomirski wrote:
>> On Fri, Sep 8, 2017 at 4:30 AM, Markus Trippelsdorf
>> <markus@xxxxxxxxxxxxxxx> wrote:
>> > On 2017.09.08 at 12:39 +0200, Markus Trippelsdorf wrote:
>> >> On 2017.09.08 at 12:35 +0200, Ingo Molnar wrote:
>> >> >
>> >> > * Markus Trippelsdorf <markus@xxxxxxxxxxxxxxx> wrote:
>> >> >
>> >> > > On 2017.09.08 at 11:16 +0200, Borislav Petkov wrote:
>> >> > > > On Fri, Sep 08, 2017 at 10:05:36AM +0200, Borislav Petkov wrote:
>> >> > > > > On Fri, Sep 08, 2017 at 08:26:44AM +0200, Thomas Gleixner wrote:
>> >> > > > > > On Fri, 8 Sep 2017, Markus Trippelsdorf wrote:
>> >> > > > > >
>> >> > > > > > CC+ Borislav. He might have access to such a beast
>> >> > > > >
>> >> > > > > Can I have /proc/cpuinfo and dmesg pls, in order to see whether I have
>> >> > > > > something similar?
>> >> > > > >
>> >> > > > > Private mail's fine too.
>> >> > > >
>> >> > > > So I don't have exactly your model - mine is model 2, stepping 3 but I see
>> >> > > > something strange too, in dmesg:
>> >> > >
>> >> > > I'm pretty sure the bug is in the merged 'x86-mm-for-linus' branch:
>> >> > > Either Andy's "PCID optimized TLB flushing" (would be my guess) or
>> >> > > 'encrypted memory' support by Tom Lendacky.
>> >> > >
>> >> > > (Bisecting is hard, because sometimes I can compile stuff for over 15
>> >> > > minutes without hitting the bug. At other times the machine locks up
>> >> > > hard when starting X11 already.)
>> >> >
>> >> > Do you have the 72c0098d92ce fix?
>> >>
>> >> Yes. The bug still happens on the current git tree (which has the fix
>> >> already):
>> >
>> > The bug is definitely caused by Andy Lutomirski's PCID optimized TLB
>> > flushing" patches. Tom is off the hook.
>>
>> I'm pretty sure it can't be PCID per se, since these CPUs are way too
>> old and are very unlikely to have PCID.
>
> Yes, the CPU doesn't support PCID (,but it does support PGE).
>
>> It could plausibly be the lazy TLB flushing changes.
>
> Yes, I've narrowed it down to:
>
> commit 94b1b03b519b81c494900cb112aa00ed205cc2d9
> Author: Andy Lutomirski <luto@xxxxxxxxxx>
> Date: Thu Jun 29 08:53:17 2017 -0700
>
> x86/mm: Rework lazy TLB mode and TLB freshness tracking
>
>
> Theoretically you guys should be able to reproduce the issue by using
> the "nopcid" boot option.
>

Any chance you could test with CONFIG_DEBUG_VM=y? There are lots of
potentially useful assertions in that code.

Can you also post your /proc/cpuinfo? And can you re-confirm that a
problematic guest kernel is causing problems in the *host*?