Re: [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs

From: Andy Lutomirski
Date: Mon Feb 23 2015 - 21:32:23 EST


On Mon, Feb 23, 2015 at 6:14 PM, Maciej W. Rozycki <macro@xxxxxxxxxxxxxx> wrote:
> On Mon, 23 Feb 2015, Andy Lutomirski wrote:
>
>> >> After a context switch, the instructions from the old task are no
>> >> longer in the pipeline.
>> >
>> > I'd say it's implementation-specific. As I mentioned the i486 aborted
>> > any transcendental x87 instruction in progress upon taking an exception or
>> > interrupt. That was a model like you refer to, but as I also mentioned it
>> > had its shortcomings.
>>
>> IRET is serializing, according to the the docs (I think) and according
>> to the Intel engineers I asked (I'm absolutely certain about this
>> part). So FPU ops are entirely done at the end of a normal context
>> switch.
>
> No question about the serialising property of IRET, it has been like this
> since the original Pentium implementation. Do you have an architecture
> specification reference to back up your claim though as far as the FPU is
> concerned? I'm asking because I am genuinely curious.
>
> The x87 case is so special, there isn't anything there really that is
> externally observable or should be affected by IRET or any other
> synchronisation barriers apart from WAIT (or a waiting x87 instruction)
> that has been there for this purpose since forever. And it would defeat
> some documented benefits of running the FP pipeline in the parallel.

It's plausible that this is special, but I doubt it. Especially since
this optimization would be nuts post-SSE2.

>
> And certainly such synchronisation didn't happen in the old days.
>
>> We also always save the FPU context on every context switch away from
>> a task that used the FPU, even in lazy mode. This is because we might
>> switch the task back in on a different CPU, and we don't want to use
>> an IPI to move the FPU context.
>
> That's an interesting case too, although not necessarily related. If you
> say that we always save the FP context eagerly for the purpose of process
> migration, then sure, that invalidates any benefit we'd have from letting
> the x87 proceed.
>
> However I can see different ways to address this case avoiding the need
> of eager FP context saving or an IPI:
>
> 1. We could bind any currently suspended process with an unsaved FP
> context to the CPU it last executed on.

This would be insane.

>
> 2. We could mark such a process for migration next time and let it execute
> on the CPU that holds its FP context once more, and then save the FP
> context eagerly on the way out.

This would be worse than insane. Now, in order to wake such a process
on a different CPU, we'd have to force a *context switch* on the
source CPU. Now we're replacing a few hundred cycles at worse for a
transcendental function with at least 10k cycles (at a guess) and
possibly many orders of magnitude more if locks are held, plus
priority issues, plus total craziness.

>
> In some cases a lazily retained FP context would be preempted by another
> process before the process in question would resume anyway. In this case
> any temporary binding to a CPU could be given up.
>
>> Given that we're only talking about old CPUs here, I sincerely doubt
>> that there's any relevant case in which an fxsave can usefully wait
>> for a long-running transcendental op to finish while we continue doing
>> useful work. *Especially* since there will almost certainly be
>> several more mfences or atomic ops before the end of the context
>> switch, even if we're lucky enough to complete the context switching
>> using sysret.
>
> I am not sure what you mean by FXSAVE usefully waiting for an op, please
> elaborate. At the point you've reached FXSAVE and an earlier x87
> instruction hasn't completed, you've already lost. The pipeline will be
> stalled until the x87 instruction has completed and it can be hundreds of
> cycles. My point therefore has been about avoiding to execute FXSAVE for
> the old task until absolutely necessary, that with the lazy FP context
> switching would be at the next x87 (or SSE) instruction reached by the new
> task.
>
> Likewise I don't see why MFENCE or an atomic operation should affect the
> excecution of say FSINCOS. Whether the results of FSINCOS arrive before
> or after MFENCE, etc. are not externally observable.

FSINCOS; FXSAVE; MFENCE had better serialize all the way, no matter
what weird architectural crud is going on.

>
> And I'm not sure if this all affects old CPUs only -- I don't know how
> much x87 software is out there, but after all these years I'd expect quite
> some. Sure, lots of this can be recompiled to use SSE instead, but not
> all, and even where it is feasible, that's an extra burden for people,
> beyond say a routine hardware or Linux distribution or for that matter
> lone kernel upgrade. Therefore I think we need to be careful not to
> pessimise things for a subset of people too much and ideally at all.
>
> And to be clear, I am not against removing lazy FP context switching per
> se. I am just emphasizing to be careful with that and be absolutely sure
> that it does not cause excessive harm.

We're talking about the unusual case in which we context switch within
~100 cycles of a legacy transcendental operation, and, even so,
there's *still* no regression, since we don't optimize this case
today.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/