Re: [PATCH v11 018/113] KVM: TDX: create/destroy VM structure

From: Zhi Wang
Date: Sat Jan 14 2023 - 04:17:32 EST


On Fri, 13 Jan 2023 15:16:08 +0000
Sean Christopherson <seanjc@xxxxxxxxxx> wrote:

> On Fri, Jan 13, 2023, Zhi Wang wrote:
> > On Thu, 12 Jan 2023 08:31:26 -0800 isaku.yamahata@xxxxxxxxx wrote:
> > > +static void tdx_reclaim_td_page(unsigned long td_page_pa)
> > > +{
> > > + if (!td_page_pa)
> > > + return;
> > > + /*
> > > + * TDCX are being reclaimed. TDX module maps TDCX with HKID
> > > + * assigned to the TD. Here the cache associated to the TD
> > > + * was already flushed by TDH.PHYMEM.CACHE.WB before here,
> > > So
> > > + * cache doesn't need to be flushed again.
> > > + */
> > > + if (WARN_ON(tdx_reclaim_page(td_page_pa, false, 0)))
>
> The WARN_ON() can go, tdx_reclaim_page() has WARN_ON_ONCE() +
> pr_tdx_error() in all error paths.
>
> > > + /* If reclaim failed, leak the page. */
> >
> > Better add a FIXME: here as this has to be fixed later.
>
> No, leaking the page is all KVM can reasonably do here. An improved
> comment would be helpful, but no code change is required.
> tdx_reclaim_page() returns an error if and only if there's an
> unexpected, fatal error, e.g. a SEAMCALL with bad params, incorrect
> concurrency in KVM, a TDX Module bug, etc. Retrying at a later point is
> highly unlikely to be successful.

Hi:

The word "leaking" sounds like a situation left unhandled temporarily.

I checked the source code of the TDX module[1] for the possible reason to
fail when reviewing this patch:

tdx-module-v1.0.01.01.zip\src\vmm_dispatcher\api_calls\tdh_phymem_page_reclaim.c
tdx-module-v1.0.01.01.zip\src\vmm_dispatcher\api_calls\tdh_phymem_page_wbinvd.c

a. Invalid parameters. For example, page is not aligned, PA HKID is not zero...

For invalid parameters, a WARN_ON_ONCE() + return value is good enough as
that is how kernel handles similar situations. The caller takes the
responsibility.

b. Locks has been taken in TDX module. TDR page has been locked due to another
SEAMCALL, another SEAMCALL is doing PAMT walk and holding PAMT lock...

This needs to be improved later either by retry or taking tdx_lock to avoid
TDX module fails on this.

[1] https://www.intel.com/content/www/us/en/download/738875/738876/intel-trust-domain-extension-intel-tdx-module.html