Re: [PATCH 02/19] x86, fpu: Wrap get_xsave_addr() to make it safer

From: Andy Lutomirski
Date: Fri May 29 2015 - 14:30:13 EST

Next message: Mike Galbraith: "Re: sched_setscheduler() vs idle_balance() race"
Previous message: Bin Gao: "[PATCH v5 2/2] arch/x86: remove pci uart early console from early_prink.c"
In reply to: Ingo Molnar: "Re: [PATCH 02/19] x86, fpu: Wrap get_xsave_addr() to make it safer"
Next in thread: Ingo Molnar: "Re: [PATCH 02/19] x86, fpu: Wrap get_xsave_addr() to make it safer"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Fri, May 29, 2015 at 11:17 AM, Ingo Molnar <mingo@xxxxxxxxxx> wrote:
>
> * Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>
>> On Thu, May 28, 2015 at 9:24 AM, Dave Hansen <dave@xxxxxxxx> wrote:
>> > On 05/28/2015 08:01 AM, Ingo Molnar wrote:
>> >> But the real question is: can we support in-use MPX with asynchronous lazy
>> >> restore, while it's still semantically correct? I don't think so, unless you add
>> >> MPX specific synchronous restore to the context switch path, which isn't such a
>> >> good idea IMHO.
>> >
>> > Right now, we assume that the first use of the FPU gets an #ND exception to
>> > tell us that someone is using the FPU. MPX doesn't generate #ND, thus the
>> > need to do it eagerly.
>
> Basically MPX is not really a vector operation, it just uses the xstate (as in
> 'extended CPU state') context area to do easy saves/restores on context switches.
> MPX is an MMU-ish feature.
>
> That's an entirely sensible design approach, which reduces the support code needed
> for MPX, and it's not surprising that MPX accesses were not made conditional on
> CR0::TS.
>
>> > On CPUs that support it we could, instead, do an xgetbv during the context
>> > switch to ensure that all things having an xstate/xfeature but that do not
>> > generate #ND exceptions are in their init state. If they are not in their
>> > init state, we exit lazy mode.
>
> Yeah, no, we don't need to do anything complex here.
>
> This property is something we know when MPX gets enabled, so for MPX tasks we
> should either simply set _TIF_WORK_CTXSW and let __switch_to_xtra() handle it, or
> should slightly modify the eagerfpu choice code to always do eager restores when
> switching to an MPX task.
>

Do we actually know which tasks use MPX, or do we merely know which
tasks use kernel-assisted MPX?

> Nothing complex is needed to support the mixed lazy/eager model, the current FPU
> code handles it just fine, because it's already a mixed lazy/eager model :-)
>
>> > We could theoretically use the same kind of thing with the compacted xsave
>> > format to ensure that we only allocate enough space for what we *need* in the
>> > xsave buffer and not allocate for the worst-case. AVX512 has 32x512-bit
>> > registers (2kbytes) and it would be a bit of a shame to need to allocate ~3k
>> > of space.
>>
>> I understand the point of this type of optimization (except that I really don't
>> like the idea of sending SIGBUS or whatever if we fail an allocation at context
>> switch time), but why are we even considering trying to support MPX and lazy fpu
>> at the same time? Judging from all the bug reports, it seems like it's a giant
>> mess, and the code to support lazy restore is not exactly pretty.
>>
>> I would propose that we take the opposite approach and just ban eagerfpu=off
>> when MPX is enabled. We could then take the next step and default eagerfpu=on
>> for everyone and, if nothing breaks, then just delete lazy mode entirely.
>>
>> I suspect we'd have to go back to Pentium 3 or earlier to find a CPU on which
>> lazy mode is actually a good idea. Fiddling with CR0 and handling exceptions is
>> really slow, and I think we should trust CPUs with XSAVEOPT support to do their
>> job and let the older CPUs take the small performance hit, if it even is a
>> performance hit.
>
> It's not that simple, because the decision is not 'lazy versus eager', but 'mixed
> lazy/eager versus eager-only':
>
> Even on modern machines, if a task is not using the FPU (it's doing integer only
> work, with short sleeps just shuffling around requests, etc.) then context
> switches get up to 5-10% faster with lazy FPU restores.

That's only sort of true. I'd believe that a context switch between
two lazy tasks is 5-10% faster than a context switch between two eager
tasks. I bet that a context switch between a lazy task and an eager
task is a whole lot slower than a context switch between two eager
tasks because manipulating CR0.TS is incredibly slow on all modern
CPUs AFAICT. It's even worse in a VM guest.

In other words, with lazy restore, we save the XRSTOR(S) and possibly
a subsequent XSAVEOPT/XSAVES, but the cost is a MOV to CR0 and
possibly a CLTS, and the MOV to CR0 is much, much slower than even a
worst-case XRSTOR(S). In the worst lazy-restore case, we also pay a
full exception roundtrip, and everything pales in comparison. If
we're a guest, then there's probably a handful of exits thrown in for
good measure.

For true integer-only tasks, I think we should instead convince glibc
to add things like vzeroall in convenient places to force as much
xstate as possible to the init state, thus speeding up the optimized
save/restore variants.

I think the fundamental issue here is that CPU designers care about
xstate save/restore/optimize performance, but they don't care at all
about TS performance, so TS manipulations are probably microcoded and
serializing.

>
> So we have this dynamic measurement code in place in the lazy case that
> opportunistically enables eagerfpu handling on a per task basis, and that method
> works pretty efficiently and has a good hit rate in isolating FPU-users from
> integer-users.
>
> So it's not 'lazy restores versus eager restores', but:
>
> - optimized, mixed lazy and eager use
> vs.
> - eager-only use
>
> Which is a lot less clear-cut choice.
>
> It's true that right now we forcibly use eagerfpu on all modern CPUs (XSAVE
> supporting ones - in essence modern Intel CPUs) which hides all this - but if you
> re-enable it it's measurable even on Intel systems. On AMD systems it's the
> current state of affairs right now.
>
> Also, I'd like to point out that the FPU code is a lot less of a mess in the
> latest x86/fpu tree! ;-)

That part's certainly true.

>
> I'd not give up on lazy restores just yet - or at least not without much better
> measurements backing it all up...

Fair enough. I suspect that the only workloads on which it will win
are old 32-bit distros, though -- even integer-only 64-bit workloads
are likely to use SSE2 for things like memcpy.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Mike Galbraith: "Re: sched_setscheduler() vs idle_balance() race"
Previous message: Bin Gao: "[PATCH v5 2/2] arch/x86: remove pci uart early console from early_prink.c"
In reply to: Ingo Molnar: "Re: [PATCH 02/19] x86, fpu: Wrap get_xsave_addr() to make it safer"
Next in thread: Ingo Molnar: "Re: [PATCH 02/19] x86, fpu: Wrap get_xsave_addr() to make it safer"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]