Re: TDX #VE in SYSCALL gap (was: [RFD] x86: Curing the exception and syscall trainwreck in hardware)

From: Linus Torvalds
Date: Sun Aug 30 2020 - 14:38:01 EST


On Sun, Aug 30, 2020 at 8:37 AM Andy Lutomirski <luto@xxxxxxxxxx> wrote:
>
> There's no such thing as "just" using an IST. Using IST opens a huge
> can of works due to its recursion issues.

I absolutely despise all the x86 "indirect system structures". They
are horrible garbage. IST is only yet another example of that kind of
brokenness, and annoys me particularly because it (and swapgs) were
actually making x86 _worse_.

The old i386 exception model was actually better than what x86-64 did,
and IST is a big part of the problem. Just have a supervisor stack,
and push the state on it. Stop playing games with multiple stacks
depending on some magical indirect system state.

Other examples of stupid and bad indirection:

- the GDT and LDT.

The kernel should never have to use them. It would be much better
if the segment "shadow" state would stop being shadow state, and be
the REAL state that the kernel (and user space, for that matter)
accesses.

Yeah, we got halfway there with MSR_FS/GS_BASE, but what a complete
garbage crock that was. So now we're forced to use the selector *and*
the base reghister, and they may be out of sync with each other, so
you have the worst of both worlds.

Keep the GDT and LDT around for compatibility reasons, so that old
broken programs that want to load the segment state the oldfashioned
way can do so. But make it clear that that is purely for legacy, and
make the modern code just save and restore the actual true
non-indirect segment state.

For new models, give us a way to load base/limit/permissions
directly, and reset them on kernel entry. No more descriptor table
indirection games.

- the IDT and the TSS segment.

Exact same arguments as above. Keep them around for legacy
programs, but let us just set "this is the entrypoint, this the the
kernel stack" as registers. Christ, we're probably better off with one
single entry-point for the whole kernel (ok, give us a separate one
for NMI/MCE/doublefault, since they are _so_ special, and maybe
separate "CPU exceptions" from "external interrupts), together with
just a register that says what the exception was.

- swapgs needs to die.

The kernel GS/FS segments should just be separate segment registers
from user space. No "swapping" needed. In CPL0, "gs" just means
something different from user space. No save/restore code for it, no
swapping, no nothing.

Honestly, I think %rsp/%rip could work like that too. Just make "rsp"
and "rip" be a completely different register in kernel mode - rename
it in the front-end of the CPU or whatever.

Imagine not having to save/restore rsp/rip on kernel entry/exit at
all, because returning to user more just implicitly starts using
ursp/urip. And a context switch uses (fast) MSR's to save/restore the
user state (or, since it's actually a real register in the register
file, just a new "mov" instruction to access the user registers).

Linus