Re: [PATCH v7 036/102] KVM: x86/mmu: Allow non-zero value for non-present SPTE

From: Yuan Yao
Date: Mon Jul 11 2022 - 03:05:18 EST


On Fri, Jul 08, 2022 at 03:30:23PM +0000, Sean Christopherson wrote:
> Please trim replies.
>
> On Fri, Jul 08, 2022, Yuan Yao wrote:
> > On Mon, Jun 27, 2022 at 02:53:28PM -0700, isaku.yamahata@xxxxxxxxx wrote:
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index 51306b80f47c..f239b6cb5d53 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -668,6 +668,44 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
> > > }
> > > }
> > >
> > > +static inline void kvm_init_shadow_page(void *page)
> > > +{
> > > +#ifdef CONFIG_X86_64
> > > + int ign;
> > > +
> > > + WARN_ON_ONCE(shadow_nonpresent_value != SHADOW_NONPRESENT_VALUE);
> > > + asm volatile (
> > > + "rep stosq\n\t"
>
> I have a slight preference for:
>
> asm volatile ("rep stosq\n\t"
> <align here>
> );
>
> so that searching for "asm" or "asm volatile" shows the "rep stosq" in the
> result without needed to capture the next line.
>
> > > + : "=c"(ign), "=D"(page)
> > > + : "a"(SHADOW_NONPRESENT_VALUE), "c"(4096/8), "D"(page)
> > > + : "memory"
> > > + );
> > > +#else
> > > + BUG();
> > > +#endif
>
> Rather than put the #ifdef here, split mmu_topup_shadow_page_cache() on 64-bit
> versus 32-bit. Then this BUG() goes away and we don't get slapped on the wrist
> by Linus :-)
>
> > > +}
> > > +
> > > +static int mmu_topup_shadow_page_cache(struct kvm_vcpu *vcpu)
> > > +{
> > > + struct kvm_mmu_memory_cache *mc = &vcpu->arch.mmu_shadow_page_cache;
> > > + int start, end, i, r;
> > > + bool is_tdp_mmu = is_tdp_mmu_enabled(vcpu->kvm);
> > > +
> > > + if (is_tdp_mmu && shadow_nonpresent_value)
> > > + start = kvm_mmu_memory_cache_nr_free_objects(mc);
> > > +
> > > + r = kvm_mmu_topup_memory_cache(mc, PT64_ROOT_MAX_LEVEL);
> > > + if (r)
> > > + return r;
>
> Bailing immediately is wrong. If kvm_mmu_topup_memory_cache() fails after allocating
> at least one page, then KVM needs to initialize those pages, otherwise it will leave
> uninitialized pages in the cache. If userspace frees up memory in response to the
> -ENOMEM and resumes the vCPU, KVM will consume uninitialized data.
>
> > > +
> > > + if (is_tdp_mmu && shadow_nonpresent_value) {
>
> So I'm pretty sure I effectively suggested keeping shadow_nonpresent_value, but
> seeing it in code, I really don't like it. It's an unnecessary check on every
> SPT allocation, and it's misleading because it suggests shadow_nonpresent_value
> might be zero when the TDP MMU is enabled.
>
> My vote is to drop shadow_nonpresent_value and then rename kvm_init_shadow_page()
> to make it clear that it's specific to the TDP MMU.
>
> So this? Completely untested.
>
> #ifdef CONFIG_X86_64
> static void kvm_init_tdp_mmu_shadow_page(void *page)
> {
> int ign;
>
> asm volatile ("rep stosq\n\t"
> : "=c"(ign), "=D"(page)
> : "a"(SHADOW_NONPRESENT_VALUE), "c"(4096/8), "D"(page)
> : "memory"
> );
> }
>
> static int mmu_topup_shadow_page_cache(struct kvm_vcpu *vcpu)
> {
> struct kvm_mmu_memory_cache *mc = &vcpu->arch.mmu_shadow_page_cache;
> bool is_tdp_mmu = is_tdp_mmu_enabled(vcpu->kvm);
> int start, end, i, r;
>
> if (is_tdp_mmu)
> start = kvm_mmu_memory_cache_nr_free_objects(mc);
>
> r = kvm_mmu_topup_memory_cache(mc, PT64_ROOT_MAX_LEVEL);
>
> /*
> * Note, topup may have allocated objects even if it failed to allocate
> * the minimum number of objects required to make forward progress _at
> * this time_. Initialize newly allocated objects even on failure, as
> * userspace can free memory and rerun the vCPU in response to -ENOMEM.
> */
> if (is_tdp_mmu) {
> end = kvm_mmu_memory_cache_nr_free_objects(mc);
> for (i = start; i < end; i++)
> kvm_init_tdp_mmu_shadow_page(mc->objects[i]);
> }
> return r;
> }
> #else
> static int mmu_topup_shadow_page_cache(struct kvm_vcpu *vcpu)
> {
> return kvm_mmu_topup_memory_cache(vcpu->arch.mmu_shadow_page_cache,
> PT64_ROOT_MAX_LEVEL);
> }
> #endif /* CONFIG_X86_64 */
>
> > > + end = kvm_mmu_memory_cache_nr_free_objects(mc);
> > > + for (i = start; i < end; i++)
> > > + kvm_init_shadow_page(mc->objects[i]);
> > > + }
> > > + return 0;
> > > +}
> > > +
>
> ...
>
> > > @@ -5654,7 +5698,8 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
> > > vcpu->arch.mmu_page_header_cache.kmem_cache = mmu_page_header_cache;
> > > vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
> > >
> > > - vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
> > > + if (!(is_tdp_mmu_enabled(vcpu->kvm) && shadow_nonpresent_value))
> > > + vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
> >
> > I'm not sure why skip this for TDX, arch.mmu_shadow_page_cache is
> > still used for allocating sp->spt which used to track the S-EPT in kvm
> > for tdx guest. Anything I missed for this ?
>
> Shared EPTEs need to be initialized with SUPPRESS_VE=1, otherwise not-present
> EPT violations would be reflected into the guest by hardware as #VE exceptions.
> This is handled by initializing page allocations via kvm_init_shadow_page() during
> cache topup if shadow_nonpresent_value is non-zero. In that case, telling the
> page allocation to zero-initialize the page would be wasted effort.
>
> The initialization is harmless for S-EPT entries because KVM's copy of the S-EPT
> isn't consumed by hardware, and because under the hood S-EPT entries should never
> #VE (I forget if this is enforced by hardware or if the TDX module sets SUPPRESS_VE).

Ah I see, you're right, thanks for the explanation! I think with
changes you suggested above the __GFP_ZERO can be removed from
mmu_shadow_page_cache for VMs which is_tdp_mmu_enabled() is true:

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 8de26cbde295..0b412f3eb0c5 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6483,8 +6483,8 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
vcpu->arch.mmu_page_header_cache.kmem_cache = mmu_page_header_cache;
vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;

- if (!(tdp_enabled && shadow_nonpresent_value))
- vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
+ if (!(is_tdp_mmu_enabled(vcpu->kvm))
+ vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;

vcpu->arch.mmu = &vcpu->arch.root_mmu;
vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;