TDX #VE in SYSCALL gap (was: [RFD] x86: Curing the exception and syscall trainwreck in hardware)

From: Sean Christopherson
Date: Tue Aug 25 2020 - 00:40:04 EST


+Andy

On Mon, Aug 24, 2020 at 02:52:01PM +0100, Andrew Cooper wrote:
> And to help with coordination, here is something prepared (slightly)
> earlier.
>
> https://docs.google.com/document/d/1hWejnyDkjRRAW-JEsRjA5c9CKLOPc6VKJQsuvODlQEI/edit?usp=sharing
>
> This identifies the problems from software's perspective, along with
> proposing behaviour which ought to resolve the issues.
>
> It is still a work-in-progress.  The #VE section still needs updating in
> light of the publication of the recent TDX spec.

For #VE on memory accesses in the SYSCALL gap (or NMI entry), is this
something we (Linux) as the guest kernel actually want to handle gracefully
(where gracefully means not panicking)? For TDX, a #VE in the SYSCALL gap
would require one of two things:

a) The guest kernel to not accept/validate the GPA->HPA mapping for the
relevant pages, e.g. code or scratch data.

b) The host VMM to remap the GPA (making the GPA->HPA pending again).

(a) is only possible if there's a fatal buggy guest kernel (or perhaps vBIOS).
(b) requires either a buggy or malicious host VMM.

I ask because, if the answer is "no, panic at will", then we shouldn't need
to burn an IST for TDX #VE. Exceptions won't morph to #VE and hitting an
instruction based #VE in the SYSCALL gap would be a CPU bug or a kernel bug.
Ditto for a #VE in NMI entry before it gets to a thread stack.

Am I missing anything?