Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf

From: Borislav Petkov
Date: Sat Sep 09 2017 - 15:10:16 EST


On Sat, Sep 09, 2017 at 11:47:33AM -0700, Linus Torvalds wrote:
> The thing is, even with the delayed TLB flushing, I don't think it
> should be *so* delayed that we should be seeing a TLB fill from
> garbage page tables.

Yeah, but we can't know what kind of speculative accesses happen between
the removal from the mask and the actual flushing.

> But the part in Andy's patch that worries me the most is that
>
> + cpumask_clear_cpu(cpu, mm_cpumask(mm));
>
> in enter_lazy_tlb(). It means that we won't be notified by peopel
> invalidating the page tables, and while we then do re-validate the TLB
> when we switch back from lazy mode, I still worry. I'm not entirely
> convinced by that tlb_gen logic.
>
> I can't actually see anything *wrong* in the tlb_gen logic, but it worries me.

Yeah, sounds like we're uncovering a situation of possibly stale
mappings which we haven't had before. Or at least widening that window.

And I still need to analyze what that MCE on Markus' machine is saying
exactly. The TlbCacheDis thing is an optimization which does away with
memory type checks. But we probably will have to disable it on those
boxes as we can't guarantee pagetable elements are all in WB mem...

Or we can guarantee them in WB but the lazy flushing delays the actual
clearing of the TLB entries so much so that they end up pointing to
garbage, as you say, which is not in WB mem and thus causes the protocol
error.

Hmm. All still wet.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.