Re: [PATCH v5 25/28] x86/fpu/xstate: Skip writing zeros to signal frame for dynamic user states if in INIT-state

From: Andy Lutomirski
Date: Tue May 25 2021 - 00:47:15 EST




On Mon, May 24, 2021, at 11:15 AM, Len Brown wrote:
> On Sun, May 23, 2021 at 11:28 PM Andy Lutomirski <luto@xxxxxxxxxx> wrote:
>
> > But what happens if we don't have the XGETBV1 feature? Are we making
> > AMX support depend on XGETBV1?
>
> Yes, AMX systems always have XGETBV.
>
> > How does this patch interact with "[PATCH v5 24/28] x86/fpu/xstate: Use
> > per-task xstate mask for saving xstate in signal frame"? They seem to
> > be try to do something similar but not quite the same, and they seem to
> > be patching the same function. The result seems odd.
>
> The previous patch allowed non-AMX tasks to skip writing 8KB of zeros
> to their signal stack. This patch builds on that to allow an AMX-task
> in INIT to also skip writing 8KB of zeros to its signal stack.

If we ever have a task that is “non-AMX” but has non-INIT AMX, then either we or the hardware has seriously screwed up.

So the logic boils down to: if (a) optimize; if (b) optimize; where a implies b. I maintain my objection — I think this is redundant.

>
> > Finally, isn't part of the point that we need to avoid even *allocating*
> > space for non-AMX-using tasks? That would require writing out the
> > compacted format and/or fiddling with XCR0.
>
> The current signal stack ABI is that the user receives all
> architectural state in
> uncompressed XTATE format on the signal stack. They can XRESTOR it
> and the right thing happens.
>
> We can optimize the data transfer (and we have), but we can't optimize the space
> without changing that ABI.

We are between a rock and a hard place here. On the one hand, our ABI (supposedly — no one seems to have really confirmed this) has the signal frame in uncompacted form. On the other hand, our ABI most definitely has signal frames under 2kB by a decent amount.

I think we should apply a simple patch to Linux to add a boot parameter x86_extra_signal_space=8192 that will pad the signal frame by 8192 bytes. And then we ask people to seriously test their workloads with that parameter set. If things break, then AMX in its current proposed form may be off the table.

At least my dynamic XCR0 proposal has the property that legacy programs and AMX can reliably coexist on the same system.