Re: [PATCH v19 023/130] KVM: TDX: Initialize the TDX module when loading the KVM intel kernel module

From: Huang, Kai
Date: Wed Apr 10 2024 - 09:12:21 EST


On Fri, 2024-03-22 at 14:23 -0700, Isaku Yamahata wrote:
> > > + r = atomic_read(&enable.err);
> > > + if (!r)
> > > + r = tdx_module_setup();
> > > + else
> > > + r = -EIO;
> > > + on_each_cpu(vmx_off, &enable.enabled, true);
> > > + cpus_read_unlock();
> > > + free_cpumask_var(enable.enabled);
> > > +
> > > +out:
> > > + return r;
> > > +}
> >
> > At last, I think there's one problem here:
> >
> > KVM actually only registers CPU hotplug callback in kvm_init(), which happens
> > way after tdx_hardware_setup().
> >
> > What happens if any CPU goes online *BETWEEN* tdx_hardware_setup() and
> > kvm_init()?
> >
> > Looks we have two options:
> >
> > 1) move registering CPU hotplug callback before tdx_hardware_setup(), or
> > 2) we need to disable CPU hotplug until callbacks have been registered.
> >
> > Perhaps the second one is easier, because for the first one we need to make sure
> > the kvm_cpu_online() is ready to be called right after tdx_hardware_setup().
> >
> > And no one cares if CPU hotplug is disabled during KVM module loading.
> >
> > That being said, we can even just disable CPU hotplug during the entire
> > vt_init(), if in this way the code change is simple?
> >
> > But anyway, to make this patch complete, I think you need to replace
> > vmx_hardware_enable() to vt_hardware_enable() and do tdx_cpu_enable() to handle
> > TDX vs CPU hotplug in _this_ patch.
>
> The option 2 sounds easier. But hardware_enable() doesn't help because it's
> called when the first guest is created. It's risky to change it's semantics
> because it's arch-independent callback.
>
> - Disable CPU hot plug during TDX module initialization.

As we talked, it turns out it is problematic to do so, because cpus_read_lock()
is also called by some other internal functions like static_call_update(). If
we take cpus_read_lock() for the entire vt_init() then we will have nested
cpus_read_lock().

> - During hardware_setup(), enable VMX, tdx_cpu_enable(), disable VMX
>   on online cpu. Don't rely on KVM hooks.
> - Add a new arch-independent hook, int kvm_arch_online_cpu(). It's called always
>   on cpu onlining. It eventually calls tdx_cpu_enabel(). If it fails, refuse
>   onlining.

So the purpose of kvm_arch_online_cpu() is to always do "VMXON +
tdx_cpu_enable() + VMXOFF" _regardless_ of the kvm_usage_count, so that we can
make sure that:

When TDX is enabled by KVM, all online cpus are TDX-capable (have done
tdx_cpu_enable() successfully).

And the code will be like:

static int kvm_online_cpu(unsigned int cpu)
{
mutex_lock(&kvm_lock);
ret = kvm_arch_online_cpu(cpu);
if (!ret && kvm_usage_count)
ret = __hardware_enable_nolock();
mutex_unlock(&kvm_lock);
}

This will need another kvm_x86_ops->online_cpu() where we can implement the TDX
specific "VMXON + tdx_cpu_enable() + VMXOFF":

int kvm_arch_online_cpu(unsigned int cpu)
{
return static_call(kvm_x86_online_cpu)(cpu);
}

Somehow I don't quite like this because: 1) it introduces a new kvm_x86_ops-
>online_cpu(); 2) it's a little bit silly to do "VMXON + tdx_cpu_enable() +
VMXOFF" just for TDX and then immediately do VMXON when there's KVM usage.

And IIUC, it will NOT work if kvm_online_cpu() happens when kvm_usage_count > 0:
VMXON has actually already been done on this cpu, so that the "VMXON" before
tdx_cpu_enable() will fail. Probably this can be addressed somehow, but still
doesn't seem nice.

So the above "option 1" doesn't seem right to me.

After thinking again, I think we have been too nervous about "losing
CPU hotplug between tdx_enable() and kvm_init(), and when there's no KVM
usage".

Instead, I think it's acceptable we don't do tdx_cpu_enable() for new
CPU when it is hotplugged during the above two cases.

We just need to guarantee all online cpus are TDX capable "when there's
real KVM usage, i.e., there's real VM running".

So "option 2":

I believe we just need to do tdx_cpu_enable() in vt_hardware_enable()
after vmx_hardware_enable():

1) When the first VM is created, KVM will try to do tdx_cpu_enable() for
those CPUs that becomes online after tdx_enable(), and if any
tdx_cpu_enabled() fails, the VM will not be created. Otherwise, all
online cpus are TDX-capable.

2) When there's real VM running, and when a new CPU can successfully
become online, it must be TDX-capable.

Failure of tdx_cpu_enable() in 2) is obviously fine.

The consequence of failure to do tdx_cpu_enable() in 1) is that, besides
VM cannot be created, there might be some online CPUs are not
TDX-capable when TDX is marked as enabled.

It's fine from KVM's perspective, because literally no VM is running.

From host kernel's perspective, the only tricky thing is #MC handler.
It tries to use SEAMCALL to read the faulty page's status to determine
whether it is a TDX private page. If #MC happens on those
non-TDX-capable cpus, then the SEAMCALL will fail. But that is also
fine as we don't need a precise result anyway.

Option 3:

We want still to make sure our goal: 

When TDX is enabled by KVM, all online cpus are TDX-capable.

For that, we can register an *additional* TDX specific CPU hotplug callback
right after tdx_enable() to handle any CPU hotplug "between tdx_enable() and
kvm_init(), and when there's no KVM usage".

Specifically, we can use dynamically allocated CPU hotplug state to avoid having
to hard-code another KVM-TDX-specific CPU hotplug callback state:

r = cpuhp_setup_state_nocalls(CPUHP_AP_KVM_DYN, "kvm/cpu/tdx:online",
tdx_online_cpu, NULL);

In tdx_online_cpu(), we do tdx_cpu_enable() if enable_tdx is true.

The problem is if tdx_online_cpu() happens when there's already KVM usage, the
kvm_online_cpu() has already done VMXON. So tdx_online_cpu() will need to grab
the @kvm_lock mutex and check @kvm_usage_count to determine whether to do VMXON
before tdx_cpu_enable().

However both @kvm_lock mutex and @kvm_usage_count are in kvm.ko, and it's not
nice to export such low level thing to kvm-intel.ko for TDX.

Option 4:

To avoid exporting @kvm_lock and @kvm_usage_count, we can still register the
TDX-specific CPU hotplug callback, but choose to do "unconditional
tdx_cpu_enable()" w/o doing VMXON in tdx_online_cpu(). If that fails due to
VMXON hasn't been done, let it fail. This basically means:

When TDX is enabled, KVM can only online CPU when there's running VM.

To summarize:

I think "option 2" should be the best solution for now. It's easy to implement
and yet has no real issue.

Any comments?