Re: [patch 2.6.13-rc3a] i386: inline restore_fpu

From: Bill Davidsen
Date: Mon Jul 25 2005 - 14:29:34 EST

Next message: Lee Revell: "Re: xor as a lazy comparison"
Previous message: Steven Rostedt: "Re: xor as a lazy comparison"
In reply to: Kenneth Parrish: "Re: [patch 2.6.13-rc3a] i386: inline restore_fpu"
Next in thread: Chuck Ebbert: "Re: [patch 2.6.13-rc3a] i386: inline restore_fpu"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Linus Torvalds wrote:

On Fri, 22 Jul 2005, Adrian Bunk wrote:

If this patch makes a difference, could you do me a favour and check whether replacing the current cpu_has_fxsr #define in
include/asm-i386/cpufeature.h with

#define cpu_has_fxsr 1

on top of your patch brings an additional improvement?

It would be really sad if it made a difference. There might be a branch
mispredict, but the real expense of the fnsave/fxsave will be that
instruction itself, and any cache misses associated with it. The 9%
performace difference would almost have to be due to a memory bank
conflict or something (likely some unnecessary I$ prefetching that
interacts badly with the writeback needed for the _big_ memory write
forced by the fxsave).

I can't see any way that a single branch mispredict could make that big of a difference, but I _can_ see how bad memory access patterns could do it.

Btw, the switch from fnsave to fxsave (and thus the change from a 112-byte
save area to a 512-byte one, or whatever the exact details are) caused
_huge_ performance degradation for various context switching benchmarks. I
really hated that, but obviously the need to support SSE2 made it
non-optional. The point being that the real overhead is that big memory read/write in fxrestor/fxsave.

What _could_ make a bigger difference is not doing the lazy FPU at all. That lazy FPU is a huge optimization on 99.9% of all loads, but it sounds
like java/volanomark are broken and always use the FPU, and then we take a
big hit on doing the FP restore exception (an exception is a lot more
expensive than a mispredict).

It seems expensive to do the save/restore when it isn't needed, that's why the code got lazy. Would it be useful to have a small flag or count field and start by assuming that FPU is not used, and if the exception takes place set the count to unconditionally save the FP state for some number of context switches and then try reverting to lazy save?

That 99.9% may be a guess, but I suspect that there are a lot of applications which alternate between using FPU and not, even if they do use FPU for some parts of the application. That way the performance of lazy save would be realized for the common applications, and the overhead of exception would be greatly reduced for both the ill-behaved and legitimately FPU intensive application.

Something like the following (totally untested) should make it be
non-lazy. It's going to slow down normal task switches, but might speed up the "restoring FP context all the time" case.

Chuck? This should work fine with or without your inline thing. Does it make any difference?

--
-bill davidsen (davidsen@xxxxxxx)
"The secret to procrastination is to put things off until the
last possible moment - but no longer" -me
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Lee Revell: "Re: xor as a lazy comparison"
Previous message: Steven Rostedt: "Re: xor as a lazy comparison"
In reply to: Kenneth Parrish: "Re: [patch 2.6.13-rc3a] i386: inline restore_fpu"
Next in thread: Chuck Ebbert: "Re: [patch 2.6.13-rc3a] i386: inline restore_fpu"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]