Re: [PATCH 04/26] x86/traps: Add #VE support for TDX guest

From: Kirill A. Shutemov
Date: Thu Dec 30 2021 - 10:41:19 EST


On Thu, Dec 30, 2021 at 11:53:39AM +0100, Borislav Petkov wrote:
> On Thu, Dec 30, 2021 at 11:05:00AM +0300, Kirill A. Shutemov wrote:
> > Hm. Two sentance above the one you quoted describes (maybe badly? I donno)
> > why #VE doesn't happen in entry paths. Maybe it's not clear it covers NMI
> > entry path too.
> >
> > What if I replace the paragraph with these two:
> >
> > Kernel avoids #VEs during syscall gap and NMI entry code.
>
> because? Explain why here.

Okay.

>
> > Entry code
> > paths do not access TD-shared memory, MMIO regions, use #VE triggering
> > MSRs, instructions, or CPUID leaves that might generate #VE. Similarly,
> > to page faults and breakpoints, #VEs are allowed in NMI handlers once
> > kernel is ready to deal with nested NMIs.
> >
> > During #VE delivery, all interrupts, including NMIs, are blocked until
> > TDGETVEINFO is called. It prevents #VE nesting until kernel reads the VE
> > info.
>
> This alludes somewhat to the why above.

It addresses the apparent issue with nested #VEs. I consider it to be
separate from the issue of exceptions in the entry code.

> Now, I hear that TDX doesn't generate #VE anymore for the case where the
> HV might have unmapped/made non-private the page which contains the NMI
> entry code.
>
> Explain that here too pls.

Okay.

> And then stick that text over exc_virtualization_exception() so that it
> is clear what's going on and that it can be easily found.

Will do.

>
> And then you still need to deal with
>
> "(and should eventually be a panic, as it is expected panic_on_oops is
> set to 1 for TDX guests)."

I will drop this. Forcing panic_on_oops is out of scope for the patch.

The updated commit message is below. Let me know if something is unclear.

----------------------------8<-------------------------------------------

Virtualization Exceptions (#VE) are delivered to TDX guests due to
specific guest actions which may happen in either user space or the
kernel:

* Specific instructions (WBINVD, for example)
* Specific MSR accesses
* Specific CPUID leaf accesses
* Access to unmapped pages (EPT violation)

In the settings that Linux will run in, virtual exceptions are never
generated on accesses to normal, TD-private memory that has been
accepted.

Syscall entry code has a critical window where the kernel stack is not
yet set up. Any exception in this window leads to hard to debug issues
and can be exploited for privilege escalation. Exceptions in the NMI
entry code also cause issues. IRET from the exception handle will
re-enable NMIs and nested NMI will corrupt the NMI stack.

For these reasons, the kernel avoids #VEs during the syscall gap and
the NMI entry code. Entry code paths do not access TD-shared memory,
MMIO regions, use #VE triggering MSRs, instructions, or CPUID leaves
that might generate #VE. VMM can remove memory from TD at any point,
but access to unaccepted (or missing) private memory leads to VM
termination, not to #VE.

Similarly, to page faults and breakpoints, #VEs are allowed in NMI
handlers once the kernel is ready to deal with nested NMIs.

During #VE delivery, all interrupts, including NMIs, are blocked until
TDGETVEINFO is called. It prevents #VE nesting until the kernel reads
the VE info.

If a guest kernel action which would normally cause a #VE occurs in
the interrupt-disabled region before TDGETVEINFO, a #DF (fault
exception) is delivered to the guest which will result in an oops.

Add basic infrastructure to handle any #VE which occurs in the kernel
or userspace. Later patches will add handling for specific #VE
scenarios.

For now, convert unhandled #VE's (everything, until later in this
series) so that they appear just like a #GP by calling the
ve_raise_fault() directly. The ve_raise_fault() function is similar
to #GP handler and is responsible for sending SIGSEGV to userspace
and CPU die and notifying debuggers and other die chain users.

--
Kirill A. Shutemov