Re: [PATCHv2 00/12] TDX: Enable Dynamic PAMT

From: Edgecombe, Rick P
Date: Fri Aug 08 2025 - 19:19:10 EST


On Mon, 2025-06-09 at 22:13 +0300, Kirill A. Shutemov wrote:
> Dynamic PAMT enabling in kernel
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> Kernel maintains refcounts for every 2M regions with two helpers
> tdx_pamt_get() and tdx_pamt_put().
>
> The refcount represents number of users for the PAMT memory in the region.
> Kernel calls TDH.PHYMEM.PAMT.ADD on 0->1 transition and
> TDH.PHYMEM.PAMT.REMOVE on transition 1->0.
>
> The function tdx_alloc_page() allocates a new page and ensures that it is
> backed by PAMT memory. Pages allocated in this manner are ready to be used
> for a TD. The function tdx_free_page() frees the page and releases the
> PAMT memory for the 2M region if it is no longer needed.
>
> PAMT memory gets allocated as part of TD init, VCPU init, on populating
> SEPT tree and adding guest memory (both during TD build and via AUG on
> accept). Splitting 2M page into 4K also requires PAMT memory.
>
> PAMT memory removed on reclaim of control pages and guest memory.
>
> Populating PAMT memory on fault and on split is tricky as kernel cannot
> allocate memory from the context where it is needed. These code paths use
> pre-allocated PAMT memory pools.
>
> Previous attempt on Dynamic PAMT enabling
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> The initial attempt at kernel enabling was quite different. It was built
> around lazy PAMT allocation: only trying to add a PAMT page pair if a
> SEAMCALL fails due to a missing PAMT and reclaiming it based on hints
> provided by the TDX module.
>
> The motivation was to avoid duplicating the PAMT memory refcounting that
> the TDX module does on the kernel side.
>
> This approach is inherently more racy as there is no serialization of
> PAMT memory add/remove against SEAMCALLs that add/remove memory for a TD.
> Such serialization would require global locking, which is not feasible.
>
> This approach worked, but at some point it became clear that it could not
> be robust as long as the kernel avoids TDX_OPERAND_BUSY loops.
> TDX_OPERAND_BUSY will occur as a result of the races mentioned above.
>
> This approach was abandoned in favor of explicit refcounting.

On closer inspection this new solution also has global locking. It
opportunistically checks to see if there is already a refcount, but otherwise
when a PAMT page actually has to be added there is a global spinlock while the
PAMT add/remove SEAMCALL is made. I guess this is going to get taken somewhere
around once per 512 4k private pages, but when it does it has some less ideal
properties:
- Cache line bouncing of the lock between all TDs on the host
- An global exclusive lock deep inside the TDP MMU shared lock fault path
- Contend heavily when two TD's shutting down at the same time?

As for why not only do the lock as a backup option like the kick+lock solution
in KVM, the problem would be losing the refcount race and ending up with a PAMT
page getting released early.

As far as TDX module locking is concerned (i.e. BUSY error codes from pamt
add/remove), it seems this would only happen if pamt add/remove operate
simultaneously on the same 2MB HPA region. That is completely prevented by the
refcount and global lock, but it's a bit heavyweight. It prevents simultaneously
adding totally separate 2MB regions when we only would need to prevent
simultaneously operating on the same 2MB region.

I don't see any other reason for the global spin lock, Kirill was that it? Did
you consider also adding a lock per 2MB region, like the refcount? Or any other
granularity of lock besides global? Not saying global is definitely the wrong
choice, but seems arbitrary if I got the above right.