Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

From: Ackerley Tng
Date: Tue Jul 01 2025 - 17:49:12 EST


"Edgecombe, Rick P" <rick.p.edgecombe@xxxxxxxxx> writes:

> On Tue, 2025-07-01 at 13:01 +0800, Yan Zhao wrote:
>> > Maybe Yan can clarify here. I thought the HWpoison scenario was about TDX
>> > module
>> My thinking is to set HWPoison to private pages whenever KVM_BUG_ON() was hit
>> in
>> TDX. i.e., when the page is still mapped in S-EPT but the TD is bugged on and
>> about to tear down.
>>
>> So, it could be due to KVM or TDX module bugs, which retries can't help.
>
> We were going to call back into guestmemfd for this, right? Not set it inside
> KVM code.
>

Perhaps we had different understandings of f/g :P

I meant that TDX module should directly set the HWpoison flag on the
folio (HugeTLB or 4K, guest_memfd or not), not call into guest_memfd.

guest_memfd will then check this flag when necessary, specifically:

* On faults, either into guest or host page tables
* When freeing the page
* guest_memfd will not return HugeTLB pages that are poisoned to
HugeTLB and just leak it
* 4K pages will be freed normally, because free_pages_prepare() will
check for HWpoison and skip freeing, from __folio_put() ->
free_frozen_pages() -> __free_frozen_pages() ->
free_pages_prepare()
* I believe guest_memfd doesn't need to check HWpoison on conversions [1]

[1] https://lore.kernel.org/all/diqz5xghjca4.fsf@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/

> What about a kvm_gmem_buggy_cleanup() instead of the system wide one. KVM calls
> it and then proceeds to bug the TD only from the KVM side. It's not as safe for
> the system, because who knows what a buggy TDX module could do. But TDX module
> could also be buggy without the kernel catching wind of it.
>
> Having a single callback to basically bug the fd would solve the atomic context
> issue. Then guestmemfd could dump the entire fd into memory_failure() instead of
> returning the pages. And developers could respond by fixing the bug.
>

This could work too.

I'm in favor of buying into the HWpoison system though, since we're
quite sure this is fair use of HWpoison.

Are you saying kvm_gmem_buggy_cleanup() will just set the HWpoison flag
on the parts of the folios in trouble?

> IMO maintainability needs to be balanced with efforts to minimize the fallout
> from bugs. In the end a system that is too complex is going to have more bugs
> anyway.
>
>>
>> > bugs. Not TDX busy errors, demote failures, etc. If there are "normal"
>> > failures,
>> > like the ones that can be fixed with retries, then I think HWPoison is not a
>> > good option though.
>> >
>> > >   there is a way to make 100%
>> > > sure all memory becomes re-usable by the rest of the host, using
>> > > tdx_buggy_shutdown(), wbinvd, etc?
>>
>> Not sure about this approach. When TDX module is buggy and the page is still
>> accessible to guest as private pages, even with no-more SEAMCALLs flag, is it
>> safe enough for guest_memfd/hugetlb to re-assign the page to allow
>> simultaneous
>> access in shared memory with potential private access from TD or TDX module?
>
> With the no more seamcall's approach it should be safe (for the system). This is
> essentially what we are doing for kexec.