Re: [PATCH v8 053/103] KVM: TDX: TDP MMU TDX support

From: Sean Christopherson
Date: Tue Aug 16 2022 - 11:37:29 EST


On Sun, Aug 07, 2022, isaku.yamahata@xxxxxxxxx wrote:
> +static void tdx_unpin_pfn(struct kvm *kvm, kvm_pfn_t pfn)

Why does this helper exist? KVM should not be pinning private pages, that should
be the purview of the private backing fd.

> +{
> + struct page *page = pfn_to_page(pfn);
> +
> + put_page(page);
> + WARN_ON(!page_count(page) && to_kvm_tdx(kvm)->hkid > 0);

For all patches, please use the ONCE variant or KVM_BUG_ON() where appropriate.
An unlimited WARN/WARN_ON() is all but guaranteed to spam the kernel log and bring
the host to its knees if it fires.

> +}
> +
> +static void __tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
> + enum pg_level level, kvm_pfn_t pfn)
> +{
> + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> + hpa_t hpa = pfn_to_hpa(pfn);
> + gpa_t gpa = gfn_to_gpa(gfn);
> + struct tdx_module_output out;
> + u64 err;
> +
> + if (WARN_ON_ONCE(is_error_noslot_pfn(pfn) ||
> + !kvm_pfn_to_refcounted_page(pfn)))

I'm terribly confused. The cover letter says:

- fd-based private page v7 is integrated. This is mostly same to Chao's patches.

which I interpreted as meaning this series relies on UPM. But if that's the case,
why is KVM manually requiring private memory to be backed by "struct page" and then
manually pinning pages?

> + return;
> +
> + /* TODO: handle large pages. */
> + if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
> + return;
> +
> + /* To prevent page migration, do nothing on mmu notifier. */
> + get_page(pfn_to_page(pfn));

Again, what's going on here?

> +
> + if (likely(is_td_finalized(kvm_tdx))) {
> + err = tdh_mem_page_aug(kvm_tdx->tdr.pa, gpa, hpa, &out);
> + if (KVM_BUG_ON(err, kvm)) {
> + pr_tdx_error(TDH_MEM_PAGE_AUG, err, &out);
> + put_page(pfn_to_page(pfn));
> + }
> + return;

This return is pointless.

> + }
> +}
> +
> +static void tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
> + enum pg_level level, kvm_pfn_t pfn)
> +{
> + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> +
> + spin_lock(&kvm_tdx->seamcall_lock);
> + __tdx_sept_set_private_spte(kvm, gfn, level, pfn);
> + spin_unlock(&kvm_tdx->seamcall_lock);
> +}
> +
> +static void tdx_sept_drop_private_spte(

Please fix this style for all patches. And not just in function prototypes, but
also in function calls and probably several other places as well. The _strongly_
preferred style is:

static void tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
enum pg_level level, kvm_pfn_t pfn)

> + struct kvm *kvm, gfn_t gfn, enum pg_level level, kvm_pfn_t pfn)
> +{
> + int tdx_level = pg_level_to_tdx_sept_level(level);
> + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> + gpa_t gpa = gfn_to_gpa(gfn);
> + hpa_t hpa = pfn_to_hpa(pfn);
> + hpa_t hpa_with_hkid;
> + struct tdx_module_output out;
> + u64 err = 0;
> +
> + /* TODO: handle large pages. */
> + if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
> + return;
> +
> + spin_lock(&kvm_tdx->seamcall_lock);

Taking a spinlock completely defeats the purpose of taking mmu_lock for read. This
is definitely not an acceptable approach long term, and even for initial upstreaming
I'm hesitant to punt on adding proper retry logic.

> + if (is_hkid_assigned(kvm_tdx)) {
> + err = tdh_mem_page_remove(kvm_tdx->tdr.pa, gpa, tdx_level, &out);
> + if (KVM_BUG_ON(err, kvm)) {
> + pr_tdx_error(TDH_MEM_PAGE_REMOVE, err, &out);
> + goto unlock;
> + }
> +
> + hpa_with_hkid = set_hkid_to_hpa(hpa, (u16)kvm_tdx->hkid);
> + err = tdh_phymem_page_wbinvd(hpa_with_hkid);
> + if (WARN_ON_ONCE(err)) {

Why is the above a KVM_BUG_ON() but this one is not?

> + pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err, NULL);
> + goto unlock;
> + }
> + } else
> + /*
> + * The HKID assigned to this TD was already freed and cache
> + * was already flushed. We don't have to flush again.
> + */
> + err = tdx_reclaim_page((unsigned long)__va(hpa), hpa, false, 0);
> +
> +unlock:
> + spin_unlock(&kvm_tdx->seamcall_lock);
> +
> + if (!err)
> + tdx_unpin_pfn(kvm, pfn);
> +}
> +

...

> +static void tdx_handle_changed_private_spte(
> + struct kvm *kvm, const struct kvm_spte_change *change)

I haven't look at this in detail, and probably won't for a few weeks, but my initial
reaction is that I really, really don't like funnelling both insertion and deletion
into a single helper. E.g. it should be trivial to pass the source_pa for PAGE.ADD
via kvm_page_fault instead of shoving it into the struct kvm_tdx (or struct vcpu_tdx).
But smushing the two things together makes that painfully difficult.