Re: [PATCH v2 0/4] x86/fpu: Reduce unnecessary FNINIT and MXCSR usage

From: Andy Lutomirski
Date: Thu Jan 21 2021 - 00:08:27 EST


On Tue, Jan 19, 2021 at 11:51 PM Krzysztof Olędzki <ole@xxxxxx> wrote:
>
> On 2021-01-19 at 09:38, Andy Lutomirski wrote:
> > This series fixes two regressions: a boot failure on AMD K7 and a
> > performance regression on everything.
> >
> > I did a double-take here -- the regressions were reported by different
> > people, both named Krzysztof :)
> >
> > Changes from v1:
> > - Fix MMX better -- MMX really does need FNINIT.
> > - Improve the EFI code.
> > - Rename the KFPU constants.
> > - Changelog improvements.
> >
> > Andy Lutomirski (4):
> > x86/fpu: Add kernel_fpu_begin_mask() to selectively initialize state
> > x86/mmx: Use KFPU_387 for MMX string operations
> > x86/fpu: Make the EFI FPU calling convention explicit
> > x86/fpu/64: Don't FNINIT in kernel_fpu_begin()
>
> Hi Andy.
>
> I have quickly tested the new version on E3-1280 V2.
>
> * 5.10.9 + 7ad816762f9bf89e940e618ea40c43138b479e10 reverted (aka unfixed)
> xor: measuring software checksum speed
> avx : 38616 MB/sec
> prefetch64-sse : 25797 MB/sec
> generic_sse : 23147 MB/sec
> xor: using function: avx (38616 MB/sec)
>
> * 5.10.9 (the original)
> xor: measuring software checksum speed
> avx : 23318 MB/sec
> prefetch64-sse : 22562 MB/sec
> generic_sse : 20431 MB/sec
> xor: using function: avx (23318 MB/sec)
>
> * 5.10.9 + "Reduce unnecessary FNINIT and MXCSR usage" v2
> xor: measuring software checksum speed
> avx : 26451 MB/sec
> prefetch64-sse : 25777 MB/sec
> generic_sse : 23178 MB/sec
> xor: using function: avx (26451 MB/sec)
>
> Overall, kernel xor benchmark reports better performance on 5.10.9 than
> on 5.4.90 (see my prev e-mail), but the general trend is the same.
>
> The "unfixed" kernel is much faster for all of avx, prefetch64-sse and
> generic_sse. With the fix, we see the expected perf regression.
>
> Now, with your patchset, both prefetch64-sse and generic_sse are able to
> recover the full performance, as seen on 5.4. However, this is not the
> case for avx. While there is still an improvement, it is nowhere close
> to where it used to be.
>
> I wonder why AVX still sees a regression and if anything more can be
> done about it?
>
> Will do more tests tomorrow.

I'm moderately confident that the problem is that LDMXCSR is
considered a "legacy SSE" instruction and it's triggering the
VEX-to-SSE and SSE-to-VEX penalties. perf could tell you for sure,
and testing with VLDMXCSR might be informative.

I'm not sure whether the best solution is to try to use VLDMXCSR, to
throw some VZEROUPPER instructions in, or to try to get rid of MXCSR
initialization entirely for integer code. VZEROUPPER might be good
regardless, since for all we know, we're coming from user mode and
user mode could have been using SSE. If we do the latter, we should
probably arrange to do it just once per user-FP-to-kernel-FP
transition.

--Andy