On Thu, Apr 24, 2025 at 04:33:13PM +0800, Yan Zhao wrote:
On Thu, Apr 24, 2025 at 10:35:47AM +0300, Kirill A. Shutemov wrote:
On Thu, Apr 24, 2025 at 11:00:32AM +0800, Yan Zhao wrote:Yes, to avoid complicating kvm_tdx->nr_premapped calculation.
Basic huge page mapping/unmapping
---------------------------------
- TD build time
This series enforces that all private mappings be 4KB during the TD build
phase, due to the TDX module's requirement that tdh_mem_page_add(), the
SEAMCALL for adding private pages during TD build time, only supports 4KB
mappings. Enforcing 4KB mappings also simplifies the implementation of
code for TD build time, by eliminating the need to consider merging or
splitting in the mirror page table during TD build time.
The underlying pages allocated from guest_memfd during TD build time
phase can still be large, allowing for potential merging into 2MB
mappings once the TD is running.
It can be done before TD is running. The merging is allowed on TD build
stage.
But, yes, for simplicity we can skip it for initial enabling.
I also don't see any benefit to allow merging during TD build stage.
Right. In selftest only.
Page splitting (page demotion)
------------------------------
Page splitting occurs in two paths:
(a) with exclusive kvm->mmu_lock, triggered by zapping operations,
For normal VMs, if zapping a narrow region that would need to split a
huge page, KVM can simply zap the surrounding GFNs rather than
splitting a huge page. The pages can then be faulted back in, where KVM
can handle mapping them at a 4KB level.
The reason why TDX can't use the normal VM solution is that zapping
private memory that is accepted cannot easily be re-faulted, since it
can only be re-faulted as unaccepted. So KVM will have to sometimes do
the page splitting as part of the zapping operations.
These zapping operations can occur for few reasons:
1. VM teardown.
2. Memslot removal.
3. Conversion of private pages to shared.
4. Userspace does a hole punch to guest_memfd for some reason.
For case 1 and 2, splitting before zapping is unnecessary because
either the entire range will be zapped or huge pages do not span
memslots.
Case 3 or case 4 requires splitting, which is also followed by a
backend page splitting in guest_memfd.
(b) with shared kvm->mmu_lock, triggered by fault.
Splitting in this path is not accompanied by a backend page splitting
(since backend page splitting necessitates a splitting and zapping
operation in the former path). It is triggered when KVM finds that a
non-leaf entry is replacing a huge entry in the fault path, which is
usually caused by vCPUs' concurrent ACCEPT operations at different
levels.
Hm. This sounds like funky behaviour on the guest side.
You only saw it in a synthetic test, right? No real guest OS should do
this.
Also in case of any guest bugs.
It can only be possible if guest is reckless enough to be exposed toIs it acceptable to put warnings in host kernel in case of guest bugs or
double accept attacks.
We should consider putting a warning if we detect such case on KVM side.
attacks?
pr_warn_once() shouldn't be a big deal.
Attachment:
OpenPGP_0xB0DE9DD628BF132F.asc
Description: OpenPGP public key
Attachment:
OpenPGP_signature.asc
Description: OpenPGP digital signature