Re: [PATCH V2 2/2] x86/tdx: Skip clearing reclaimed pages unless X86_BUG_TDX_PW_MCE is present

From: Adrian Hunter
Date: Fri Jul 04 2025 - 01:38:39 EST


On 03/07/2025 20:06, Vishal Annapurve wrote:
> On Thu, Jul 3, 2025 at 8:37 AM Adrian Hunter <adrian.hunter@xxxxxxxxx> wrote:
>>
>> Avoid clearing reclaimed TDX private pages unless the platform is affected
>> by the X86_BUG_TDX_PW_MCE erratum. This significantly reduces VM shutdown
>> time on unaffected systems.
>>
>> Background
>>
>> KVM currently clears reclaimed TDX private pages using MOVDIR64B, which:
>>
>> - Clears the TD Owner bit (which identifies TDX private memory) and
>> integrity metadata without triggering integrity violations.
>> - Clears poison from cache lines without consuming it, avoiding MCEs on
>> access (refer TDX Module Base spec. 16.5. Handling Machine Check
>> Events during Guest TD Operation).
>>
>> The TDX module also uses MOVDIR64B to initialize private pages before use.
>> If cache flushing is needed, it sets TDX_FEATURES.CLFLUSH_BEFORE_ALLOC.
>> However, KVM currently flushes unconditionally, refer commit 94c477a751c7b
>> ("x86/virt/tdx: Add SEAMCALL wrappers to add TD private pages")
>>
>> In contrast, when private pages are reclaimed, the TDX Module handles
>> flushing via the TDH.PHYMEM.CACHE.WB SEAMCALL.
>>
>> Problem
>>
>> Clearing all private pages during VM shutdown is costly. For guests
>> with a large amount of memory it can take minutes.
>>
>> Solution
>>
>> TDX Module Base Architecture spec. documents that private pages reclaimed
>> from a TD should be initialized using MOVDIR64B, in order to avoid
>> integrity violation or TD bit mismatch detection when later being read
>> using a shared HKID, refer April 2025 spec. "Page Initialization" in
>> section "8.6.2. Platforms not Using ACT: Required Cache Flush and
>> Initialization by the Host VMM"
>>
>> That is an overstatement and will be clarified in coming versions of the
>> spec. In fact, as outlined in "Table 16.2: Non-ACT Platforms Checks on
>> Memory" and "Table 16.3: Non-ACT Platforms Checks on Memory Reads in Li
>> Mode" in the same spec, there is no issue accessing such reclaimed pages
>> using a shared key that does not have integrity enabled. Linux always uses
>> KeyID 0 which never has integrity enabled. KeyID 0 is also the TME KeyID
>> which disallows integrity, refer "TME Policy/Encryption Algorithm" bit
>> description in "Intel Architecture Memory Encryption Technologies" spec
>> version 1.6 April 2025. So there is no need to clear pages to avoid
>> integrity violations.
>>
>> There remains a risk of poison consumption. However, in the context of
>> TDX, it is expected that there would be a machine check associated with the
>> original poisoning. On some platforms that results in a panic. However
>> platforms may support "SEAM_NR" Machine Check capability, in which case
>> Linux machine check handler marks the page as poisoned, which prevents it
>> from being allocated anymore, refer commit 7911f145de5fe ("x86/mce:
>> Implement recovery for errors in TDX/SEAM non-root mode")
>>
>> Improvement
>>
>> By skipping the clearing step on unaffected platforms, shutdown time
>> can improve by up to 40%.
>
> This patch looks good to me.
>
> I would like to raise a related topic, is there any requirement for
> zeroing pages on conversion from private to shared before
> userspace/guest faults in the gpa ranges as shared?

For TDX, clearing must still be done for platforms with the
partial-write errata (SPR and EMR).

>
> If the answer is no for all CoCo architectures then guest_memfd can
> simply just zero pages on allocation for all it's users and not worry
> about zeroing later.

In fact TDX does not need private pages to be zeroed on allocation
because the TDX Module always does that.