Re: [PATCH v9 05/18] x86/virt/tdx: Add SEAMCALL infrastructure

From: Huang, Kai
Date: Tue Feb 14 2023 - 03:58:40 EST


On Mon, 2023-02-13 at 23:22 +0000, Huang, Kai wrote:
> On Mon, 2023-02-13 at 14:39 -0800, Dave Hansen wrote:
> > On 2/13/23 03:59, Kai Huang wrote:
> > > SEAMCALL instruction causes #GP when TDX isn't BIOS enabled, and #UD
> > > when CPU is not in VMX operation. The current TDX_MODULE_CALL macro
> > > doesn't handle any of them. There's no way to check whether the CPU is
> > > in VMX operation or not.
> >
> > Really? ... *REALLY*?
> >
> > Like, there's no possible way for the kernel to record whether it has
> > executed VMXON or not?
> >
> > I think what you're saying here is that there's no architecturally
> > visible flag that tells you whether in spot #1 or #2 in the following code:
> >
> > static int kvm_cpu_vmxon(u64 vmxon_pointer)
> > {
> > u64 msr;
> >
> > cr4_set_bits(X86_CR4_VMXE);
> > // spot #1
> > asm_volatile_goto("1: vmxon %[vmxon_pointer]\n\t"
> > _ASM_EXTABLE(1b, %l[fault])
> > : : [vmxon_pointer] "m"(vmxon_pointer)
> > : : fault);
> > // spot #2
> >
>
> Yes I was talking about architectural flag rather than kernel defined software
> tracking mechanism.
>
> > That's _maybe_ technically correct (I don't know enough about VMX
> > enabling to tell you). But, what I *DO* know is that it's nonsense to
> > say that it's impossible in the *kernel* to tell whether we're on a CPU
> > that's successfully executed VMXON or not.
> >
> > kvm_cpu_vmxon() has two paths through it:
> >
> > 1. Successfully executes VMXON and leaves with X86_CR4_VMXE=1
> > 2. Fails VMXON and leaves with X86_CR4_VMXE=0
> >
> > Guess what? CR4 is rather architecturally visible. From what I can
> > tell, it's *ENTIRELY* plausible to assume that X86_CR4_VMXE==1 means
> > that VMXON has been done.  
> >
>
> Yes CR4.VMXE bit can be used to check. This is what KVM does.
>
> Architecturally CR4.VMXE bit only checks whether VMX is enabled, but not VMXON
> has been done, but in current kernel implement they are always done together. 
>
> So checking CR4 is fine.
>
> > Even if that's wrong, it's only a cpumask and
> > a cpumask_set() away from becoming plausible. Like so:
> >
> > static int kvm_cpu_vmxon(u64 vmxon_pointer)
> > {
> > u64 msr;
> >
> > cr4_set_bits(X86_CR4_VMXE);
> >
> > asm_volatile_goto("1: vmxon %[vmxon_pointer]\n\t"
> > _ASM_EXTABLE(1b, %l[fault])
> > : : [vmxon_pointer] "m"(vmxon_pointer)
> > : : fault);
> > // set cpumask bit here
> > return 0;
> >
> > fault:
> > // clear cpu bit here
> > cr4_clear_bits(X86_CR4_VMXE);
> >
> > return -EFAULT;
> > }
> >
> > How many design decisions down the line in this series were predicated
> > on the idea that:
> >
> > There's no way to check whether the CPU is
> > in VMX operation or not.
> >
> > ?
>
> Only the assembly code to handle TDX_SEAMCALL_UD in this patch is.  
>
> Whether we have definitive way to _check_ whether VMXON has been done doesn't
> matter. What impacts the design decisions is (non-KVM) kernel doesn't support
> doing VMXON and we depend on KVM to do that (which is also a design decision).
>
> We can remove the assembly code which returns TDX_SEAMCALL_{UD|GP} and replace
> it with a below check in seamcall():
>
> static int seamcall(...)
> {
> cpu = get_cpu();
>
> if (cr4_read_shadow() & X86_CR4_VMXE) {
> WARN_ONCE("VMXON isn't done for cpu ...\n");
> ret = -EINVAL;
> }
>
> ...
>
> out:
> put_cpu();
> return ret;
> }
>
> But this was actually discussed in the v5, in which IIUC you prefer to having
> the assembly code to return additional TDX_SEAMCALL_UD rather than having above
> CR4 check:
>
> https://lore.kernel.org/all/77c90075-79d4-7cc7-f266-1b67e586513b@xxxxxxxxx/
>
>

Hmm I replied too quickly. If we need to consider other non-KVM kernel
components mistakenly call tdx_enable() w/o doing VMXON on all online cpus
first, there's one issue when using CR4.VMXE to check whether VMXON has been
done (or even whether VMX has been enabled) in my above pseudo seamcall()
implementation.

The problem is above seamcall() code isn't IRQ safe between CR4.VMXE check and
the actual SEAMCALL.

KVM does VMXON/VMXOFF for all online cpus via IPI:

// when first VM is created
on_each_cpu(hardware_enable, NULL, 1);

// when last VM is destroyed
on_each_cpu(hardware_disable, NULL, 1);

Consider this case:

1) KVM does VMXON for all online cpus (a VM created)
2) Another kernel component is calling tdx_enable()
3) KVM does VMXOFF for all online cpus (last VM is destroyed)

When 2) and 3) happen in parallel on different cpus, below race can happen:

CPU 0 (CR4.VMXE enabled) CPU 1 (CR4.VMXE enabled)

non-KVM thread calling seamcall() KVM thread doing VMXOFF via IPI

Check CR4.VMXE <- pass
on_each_cpu(hardware_disable)

send IPI to CPU 0 to do VMXOFF
<-------
// Interrupted
// IRQ handler to do VMXOFF

VMXOFF 
clear CR4.VMXE

// IRQ done.
// Resume to seamcall()

SEAMCALL <-- #UD

So we do need to handle #UD in the assembly if we want tdx_enable() to be safe
in general (doesn't cause Oops even mistakenly used out of KVM).

However, in TDX CPU online callback, checking CR4.VMXE to know whether VMXON has
been done is fine, since KVM will never send IPI to those "to-be-online" cpus.