Re: [PATCHv2 00/12] TDX: Enable Dynamic PAMT

From: kas@xxxxxxxxxx
Date: Mon Aug 11 2025 - 03:10:10 EST

Next message: Kiryl Shutsemau: "Re: [PATCH V4 mm-hotfixes 0/3] mm, x86: fix crash due to missing page table sync and make it harder to miss"
Previous message: Michael Straube: "[PATCH 1/2] staging: rtl8723bs: use crypto_xor_cpy"
In reply to: Edgecombe, Rick P: "Re: [PATCHv2 00/12] TDX: Enable Dynamic PAMT"
Next in thread: Edgecombe, Rick P: "Re: [PATCHv2 00/12] TDX: Enable Dynamic PAMT"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Fri, Aug 08, 2025 at 11:18:40PM +0000, Edgecombe, Rick P wrote:
> On Mon, 2025-06-09 at 22:13 +0300, Kirill A. Shutemov wrote:
> > Dynamic PAMT enabling in kernel
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >
> > Kernel maintains refcounts for every 2M regions with two helpers
> > tdx_pamt_get() and tdx_pamt_put().
> >
> > The refcount represents number of users for the PAMT memory in the region.
> > Kernel calls TDH.PHYMEM.PAMT.ADD on 0->1 transition and
> > TDH.PHYMEM.PAMT.REMOVE on transition 1->0.
> >
> > The function tdx_alloc_page() allocates a new page and ensures that it is
> > backed by PAMT memory. Pages allocated in this manner are ready to be used
> > for a TD. The function tdx_free_page() frees the page and releases the
> > PAMT memory for the 2M region if it is no longer needed.
> >
> > PAMT memory gets allocated as part of TD init, VCPU init, on populating
> > SEPT tree and adding guest memory (both during TD build and via AUG on
> > accept). Splitting 2M page into 4K also requires PAMT memory.
> >
> > PAMT memory removed on reclaim of control pages and guest memory.
> >
> > Populating PAMT memory on fault and on split is tricky as kernel cannot
> > allocate memory from the context where it is needed. These code paths use
> > pre-allocated PAMT memory pools.
> >
> > Previous attempt on Dynamic PAMT enabling
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >
> > The initial attempt at kernel enabling was quite different. It was built
> > around lazy PAMT allocation: only trying to add a PAMT page pair if a
> > SEAMCALL fails due to a missing PAMT and reclaiming it based on hints
> > provided by the TDX module.
> >
> > The motivation was to avoid duplicating the PAMT memory refcounting that
> > the TDX module does on the kernel side.
> >
> > This approach is inherently more racy as there is no serialization of
> > PAMT memory add/remove against SEAMCALLs that add/remove memory for a TD.
> > Such serialization would require global locking, which is not feasible.
> >
> > This approach worked, but at some point it became clear that it could not
> > be robust as long as the kernel avoids TDX_OPERAND_BUSY loops.
> > TDX_OPERAND_BUSY will occur as a result of the races mentioned above.
> >
> > This approach was abandoned in favor of explicit refcounting.
>
> On closer inspection this new solution also has global locking. It
> opportunistically checks to see if there is already a refcount, but otherwise
> when a PAMT page actually has to be added there is a global spinlock while the
> PAMT add/remove SEAMCALL is made. I guess this is going to get taken somewhere
> around once per 512 4k private pages, but when it does it has some less ideal
> properties:
> - Cache line bouncing of the lock between all TDs on the host
> - An global exclusive lock deep inside the TDP MMU shared lock fault path
> - Contend heavily when two TD's shutting down at the same time?
>
> As for why not only do the lock as a backup option like the kick+lock solution
> in KVM, the problem would be losing the refcount race and ending up with a PAMT
> page getting released early.
>
> As far as TDX module locking is concerned (i.e. BUSY error codes from pamt
> add/remove), it seems this would only happen if pamt add/remove operate
> simultaneously on the same 2MB HPA region. That is completely prevented by the
> refcount and global lock, but it's a bit heavyweight. It prevents simultaneously
> adding totally separate 2MB regions when we only would need to prevent
> simultaneously operating on the same 2MB region.
>
> I don't see any other reason for the global spin lock, Kirill was that it? Did
> you consider also adding a lock per 2MB region, like the refcount? Or any other
> granularity of lock besides global? Not saying global is definitely the wrong
> choice, but seems arbitrary if I got the above right.

We have discussed this before[1]. Global locking is problematic when you
actually hit contention. Let's not complicate things until we actually
see it. I failed to demonstrate contention without huge pages. With huge
pages it is even more dubious that we ever see it.

[1] https://lore.kernel.org/all/4bb2119a-ff6d-42b6-acf4-86d87b0e9939@xxxxxxxxx/

--
Kiryl Shutsemau / Kirill A. Shutemov

Next message: Kiryl Shutsemau: "Re: [PATCH V4 mm-hotfixes 0/3] mm, x86: fix crash due to missing page table sync and make it harder to miss"
Previous message: Michael Straube: "[PATCH 1/2] staging: rtl8723bs: use crypto_xor_cpy"
In reply to: Edgecombe, Rick P: "Re: [PATCHv2 00/12] TDX: Enable Dynamic PAMT"
Next in thread: Edgecombe, Rick P: "Re: [PATCHv2 00/12] TDX: Enable Dynamic PAMT"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]