Re: [PATCH v3 15/32] KVM: arm64: Prepare the creation of s1 mappings at EL2

From: Will Deacon
Date: Thu Mar 04 2021 - 13:49:07 EST


Hi Quentin,

On Tue, Mar 02, 2021 at 02:59:45PM +0000, Quentin Perret wrote:
> When memory protection is enabled, the EL2 code needs the ability to
> create and manage its own page-table. To do so, introduce a new set of
> hypercalls to bootstrap a memory management system at EL2.
>
> This leads to the following boot flow in nVHE Protected mode:
>
> 1. the host allocates memory for the hypervisor very early on, using
> the memblock API;
>
> 2. the host creates a set of stage 1 page-table for EL2, installs the
> EL2 vectors, and issues the __pkvm_init hypercall;
>
> 3. during __pkvm_init, the hypervisor re-creates its stage 1 page-table
> and stores it in the memory pool provided by the host;
>
> 4. the hypervisor then extends its stage 1 mappings to include a
> vmemmap in the EL2 VA space, hence allowing to use the buddy
> allocator introduced in a previous patch;
>
> 5. the hypervisor jumps back in the idmap page, switches from the
> host-provided page-table to the new one, and wraps up its
> initialization by enabling the new allocator, before returning to
> the host.
>
> 6. the host can free the now unused page-table created for EL2, and
> will now need to issue hypercalls to make changes to the EL2 stage 1
> mappings instead of modifying them directly.
>
> Note that for the sake of simplifying the review, this patch focuses on
> the hypervisor side of things. In other words, this only implements the
> new hypercalls, but does not make use of them from the host yet. The
> host-side changes will follow in a subsequent patch.
>
> Credits to Will for __pkvm_init_switch_pgd.
>
> Co-authored-by: Will Deacon <will@xxxxxxxxxx>
> Signed-off-by: Will Deacon <will@xxxxxxxxxx>
> Signed-off-by: Quentin Perret <qperret@xxxxxxxxxx>
> ---
> arch/arm64/include/asm/kvm_asm.h | 4 +
> arch/arm64/include/asm/kvm_host.h | 7 +
> arch/arm64/include/asm/kvm_hyp.h | 8 ++
> arch/arm64/include/asm/kvm_pgtable.h | 2 +
> arch/arm64/kernel/image-vars.h | 16 +++
> arch/arm64/kvm/hyp/Makefile | 2 +-
> arch/arm64/kvm/hyp/include/nvhe/mm.h | 71 ++++++++++
> arch/arm64/kvm/hyp/nvhe/Makefile | 4 +-
> arch/arm64/kvm/hyp/nvhe/hyp-init.S | 31 +++++
> arch/arm64/kvm/hyp/nvhe/hyp-main.c | 49 +++++++
> arch/arm64/kvm/hyp/nvhe/mm.c | 173 ++++++++++++++++++++++++
> arch/arm64/kvm/hyp/nvhe/setup.c | 195 +++++++++++++++++++++++++++
> arch/arm64/kvm/hyp/pgtable.c | 2 -
> arch/arm64/kvm/hyp/reserved_mem.c | 92 +++++++++++++
> arch/arm64/mm/init.c | 3 +
> 15 files changed, 654 insertions(+), 5 deletions(-)

This mostly looks good to me, but in a patch this size I was bound to spot
a few niggles. It is _huge_!

> diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-init.S b/arch/arm64/kvm/hyp/nvhe/hyp-init.S
> index c631e29fb001..bc56ea92b812 100644
> --- a/arch/arm64/kvm/hyp/nvhe/hyp-init.S
> +++ b/arch/arm64/kvm/hyp/nvhe/hyp-init.S
> @@ -244,4 +244,35 @@ alternative_else_nop_endif
>
> SYM_CODE_END(__kvm_handle_stub_hvc)
>
> +SYM_FUNC_START(__pkvm_init_switch_pgd)
> + /* Turn the MMU off */
> + pre_disable_mmu_workaround
> + mrs x2, sctlr_el2
> + bic x3, x2, #SCTLR_ELx_M
> + msr sctlr_el2, x3
> + isb
> +
> + tlbi alle2
> +
> + /* Install the new pgtables */
> + ldr x3, [x0, #NVHE_INIT_PGD_PA]
> + phys_to_ttbr x4, x3
> +alternative_if ARM64_HAS_CNP
> + orr x4, x4, #TTBR_CNP_BIT
> +alternative_else_nop_endif
> + msr ttbr0_el2, x4
> +
> + /* Set the new stack pointer */
> + ldr x0, [x0, #NVHE_INIT_STACK_HYP_VA]
> + mov sp, x0
> +
> + /* And turn the MMU back on! */
> + dsb nsh
> + isb
> + msr sctlr_el2, x2
> + ic iallu
> + isb

Comparing with the new-fangled set_sctlr_el1 macro we have, this sequence
isn't quite right. Probably best to introduce set_sctlr_el2, and implement
that and the existing macro in terms of set_sctlr_elX or something like
that.

> +void __noreturn __pkvm_init_finalise(void)
> +{
> + struct kvm_host_data *host_data = this_cpu_ptr(&kvm_host_data);
> + struct kvm_cpu_context *host_ctxt = &host_data->host_ctxt;
> + unsigned long nr_pages, reserved_pages, pfn;
> + int ret;
> +
> + /* Now that the vmemmap is backed, install the full-fledged allocator */
> + pfn = hyp_virt_to_pfn(hyp_pgt_base);
> + nr_pages = hyp_s1_pgtable_pages();
> + reserved_pages = hyp_early_alloc_nr_used_pages();
> + ret = hyp_pool_init(&hpool, pfn, nr_pages, reserved_pages);
> + if (ret)
> + goto out;
> +
> + pkvm_pgtable_mm_ops.zalloc_page = hyp_zalloc_hyp_page;
> + pkvm_pgtable_mm_ops.phys_to_virt = hyp_phys_to_virt;
> + pkvm_pgtable_mm_ops.virt_to_phys = hyp_virt_to_phys;
> + pkvm_pgtable_mm_ops.get_page = hyp_get_page;
> + pkvm_pgtable_mm_ops.put_page = hyp_put_page;
> + pkvm_pgtable.mm_ops = &pkvm_pgtable_mm_ops;

Can you do:

pkvm_pgtable_mm_ops = (struct kvm_pgtable_mm_ops) {
.zalloc_page = hyp_zalloc_hyp_page,
.phys_to_virt = ...,
...
};

here?

> +
> +out:
> + /*
> + * We tail-called to here from handle___pkvm_init() and will not return,
> + * so make sure to propagate the return value to the host.
> + */
> + cpu_reg(host_ctxt, 1) = ret;
> +
> + __host_enter(host_ctxt);
> +}
> +
> +int __pkvm_init(phys_addr_t phys, unsigned long size, unsigned long nr_cpus,
> + unsigned long *per_cpu_base, u32 hyp_va_bits)
> +{
> + struct kvm_nvhe_init_params *params;
> + void *virt = hyp_phys_to_virt(phys);
> + void (*fn)(phys_addr_t params_pa, void *finalize_fn_va);
> + int ret;
> +
> + if (phys % PAGE_SIZE || size % PAGE_SIZE)
> + return -EINVAL;

Either PAGE_ALIGNED or '& ~PAGE_MASK' would be better than spelling this
with '%', I reckon.

Anyway, other than these nits:

Acked-by: Will Deacon <will@xxxxxxxxxx>

Will