RE: FPU register granularity [Was: Re: [PATCH crypto-stable] crypto: arch/lib - limit simd usage to PAGE_SIZE chunks]

From: David Laight
Date: Tue Apr 21 2020 - 04:05:31 EST


From: Ard Biesheuvel
> Sent: 21 April 2020 08:02
> On Tue, 21 Apr 2020 at 06:15, Jason A. Donenfeld <Jason@xxxxxxxxx> wrote:
> >
> > Hi David,
> >
> > On Mon, Apr 20, 2020 at 2:32 AM David Laight <David.Laight@xxxxxxxxxx> wrote:
> > > Maybe kernel_fp_begin() should be passed the address of somewhere
> > > the address of an fpu save area buffer can be written to.
> > > Then the pre-emption code can allocate the buffer and save the
> > > state into it.
> >
> > Interesting idea. It looks like `struct xregs_state` is only 576
> > bytes. That's not exactly small, but it's not insanely huge either,
> > and maybe we could justifiably stick that on the stack, or even
> > reserve part of the stack allocation for that that the function would
> > know about, without needing to specify any address.
> >
> > > kernel_fpu_begin() ought also be passed a parameter saying which
> > > fpu features are required, and return which are allocated.
> > > On x86 this could be used to check for AVX512 (etc) which may be
> > > available in an ISR unless it interrupted inside a kernel_fpu_begin()
> > > section (etc).
> > > It would also allow optimisations if only 1 or 2 fpu registers are
> > > needed (eg for some of the crypto functions) rather than the whole
> > > fpu register set.
> >
> > For AVX512 this probably makes sense, I suppose. But I'm not sure if
> > there are too many bits of crypto code that only use a few registers.
> > There are those accelerated memcpy routines in i915 though -- ever see
> > drivers/gpu/drm/i915/i915_memcpy.c? sort of wild. But if we did go
> > this way, I wonder if it'd make sense to totally overengineer it and
> > write a gcc/as plugin to create the register mask for us. Or, maybe
> > some checker inside of objtool could help here.
> >
> > Actually, though, the thing I've been wondering about is actually
> > moving in the complete opposite direction: is there some
> > efficient-enough way that we could allow FPU registers in all contexts
> > always, without the need for kernel_fpu_begin/end? I was reversing
> > ntoskrnl.exe and was kind of impressed (maybe not the right word?) by
> > their judicious use of vectorisation everywhere. I assume a lot of
> > that is being generated by their compiler, which of course gcc could
> > do for us if we let it. Is that an interesting avenue to consider? Or
> > are you pretty certain that it'd be a huge mistake, with an
> > irreversible speed hit?
> >
>
> When I added support for kernel mode SIMD to arm64 originally, there
> was a kernel_neon_begin_partial() that took an int to specify how many
> registers were being used, the reason being that NEON preserve/store
> was fully eager at this point, and arm64 has 32 SIMD registers, many
> of which weren't really used, e.g., in the basic implementation of AES
> based on special instructions.
>
> With the introduction of lazy restore, and SVE handling for userspace,
> we decided to remove this because it didn't really help anymore, and
> made the code more difficult to manage.
>
> However, I think it would make sense to have something like this in
> the general case. I.e., NEON registers 0-3 are always preserved when
> an exception or interrupt (or syscall) is taken, and so they can be
> used anywhere in the kernel. If you want the whole set, you will have
> to use begin/end as before. This would already unlock a few
> interesting case, like memcpy, xor, and sequences that can easily be
> implemented with only a few registers such as instructio based AES.
>
> Unfortunately, the compiler needs to be taught about this to be
> completely useful, which means lots of prototyping and benchmarking
> upfront, as the contract will be set in stone once the compilers get
> on board.

You can always just use asm with explicit registers.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)