Re: [PATCH v9 14/26] x86/arch_prctl: Create ARCH_SET_STATE_ENABLE/ARCH_GET_STATE_ENABLE

From: Thiago Macieira
Date: Fri Aug 06 2021 - 12:46:38 EST


On Friday, 30 July 2021 07:59:45 PDT Chang S. Bae wrote:
> + for_each_thread(tsk, t) {
> + t->thread.fpu.dynamic_state_perm |= req_dynstate_perm;
> + nr_threads++;
> + }
> +
> + if (nr_threads != tsk->signal->nr_threads) {
> + for_each_thread(tsk, t)
> + t->thread.fpu.dynamic_state_perm =
> old_dynstate_perm;
> + pr_err("x86/fpu: ARCH_XSTATE_PERM failed
> as thread number mismatched.\n");
> + return -EBUSY;
> + }
> + return 0;
> +}

Hello all

As I was trying to write the matching userspace code, I think the solution
above had two problems.

First the simpler one: that EBUSY. It must go and you can do that with a lock.
Library code cannot ensure that it is running in single-threaded state and
that no other threads are started or exit while they make the system call.
There's nothing the library in question can do if it got an EBUSY. Do you want
me to try again? What if it fails again? What's the state of the dynamically
permitted states after an EBUSY? It's probably inconsistent. Moreover, there's
an ABA problem there: what happens if a thread starts and another exits while
this system call is running? And what happens if two threads are making this
system call?
(also, shouldn't tsk->signal->nr_threads be an atomic read?)

The second and bigger problem is the consequence of not issuing the
ARCH_SET_STATE_ENABLE call: a SIGILL. Up until now, this hasn't happened, so I
expect this to be a surprise to people, in the worst possible way. The Intel
Software Developer Manual and every single tutorial out there says that the
sequence of actions is:
1) check that OSXSAVE is enabled
2) check that the AVX, AVX512 or AMX instructions are supported with CPUID
3) execute XGETBV EAX=0
4) disable any instructions whose matching state is not enabled by the OS

This is what software developers will write for AMX and any new future state,
until they learn better. This is also all that other OSes will require to run.
Moreover, until developers can actually run their software on CPUs with AMX
support, they will not notice any missed system calls (the Software
Development Emulator tool will execute the instructions whether you've issued
the syscall or not).

As a consequence, there's a large chance that a test escape like that will
cause software to start crashing when run on AMX-capable CPUs when those start
showing up and get enabled in public clouds.

So I have to insist that the XGETBV instruction's result match exactly what is
permitted to run. That means we either enable AMX unconditionally with no need
for system calls (with or without XFD trapping to dynamically allocate more
state), or that the XCR0 register be set without the AMX bits by default,
until the system call is issued.

--
Thiago Macieira - thiago.macieira (AT) intel.com
Software Architect - Intel DPG Cloud Engineering