Re: [Patch v4 11/18] KVM: x86/mmu: Add documentation of NUMA aware page table capability

From: David Matlack
Date: Thu Mar 23 2023 - 17:59:59 EST


On Mon, Mar 06, 2023 at 02:41:20PM -0800, Vipin Sharma wrote:
> Add documentation for KVM_CAP_NUMA_AWARE_PAGE_TABLE capability and
> explain why it is needed.
>
> Signed-off-by: Vipin Sharma <vipinsh@xxxxxxxxxx>
> ---
> Documentation/virt/kvm/api.rst | 29 +++++++++++++++++++++++++++++
> 1 file changed, 29 insertions(+)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 62de0768d6aa..7e3a1299ca8e 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -7669,6 +7669,35 @@ This capability is aimed to mitigate the threat that malicious VMs can
> cause CPU stuck (due to event windows don't open up) and make the CPU
> unavailable to host or other VMs.
>
> +7.34 KVM_CAP_NUMA_AWARE_PAGE_TABLE
> +------------------------------
> +
> +:Architectures: x86
> +:Target: VM
> +:Returns: 0 on success, -EINVAL if vCPUs are already created.
> +
> +This capability allows userspace to enable NUMA aware page tables allocations.

Call out that this capability overrides task mempolicies. e.g.

This capability causes KVM to use a custom NUMA memory policy when
allocating page tables. Specifically, KVM will attempt to co-locate
page tables pages with the memory that they map, rather than following
the mempolicy of the current task.

> +NUMA aware page tables are disabled by default. Once enabled, prior to vCPU
> +creation, any page table allocated during the life of a VM will be allocated

The "prior to vCPU creation" part here is confusing because it sounds
like you're talking about any page tables allocated before vCPU
creation. Just delete that part and put it in a separate paragraph.

KVM_CAP_NUMA_AWARE_PAGE_TABLE must be enabled before any vCPU is
created, otherwise KVM will return -EINVAL.

> +preferably from the NUMA node of the leaf page.
> +
> +Without this capability, default feature is to use current thread mempolicy and

s/default feature is to/KVM will/

> +allocate page table based on that.

s/and allocate page table based on that./to allocate page tables./

> +
> +This capability is useful to improve page accesses by a guest. For example, an

nit: Be more specific about how.

This capability aims to minimize the cost of TLB misses when a vCPU is
accessing NUMA-local memory, by reducing the number of remote memory
accesses needed to walk KVM's page tables.

> +initialization thread which access lots of remote memory and ends up creating
> +page tables on local NUMA node, or some service thread allocates memory on
> +remote NUMA nodes and later worker/background threads accessing that memory
> +will end up accessing remote NUMA node page tables.

It's not clear if these examples are talking about what happens when
KVM_CAP_NUMA_AWARE_PAGE_TABLE is enabled or disabled.

Also it's important to distinguish virtual NUMA nodes from physical NUMA
nodes and where these "threads" are running. How about this:

For example, when KVM_CAP_NUMA_AWARE_PAGE_TABLE is disabled and a vCPU
accesses memory on a remote NUMA node and triggers a KVM page fault,
KVM will allocate page tables to handle that fault on the node where
the vCPU is running rather than the node where the memory is allocated.
When KVM_CAP_NUMA_AWARE_PAGE_TABLE is enabled, KVM will allocate the
page tables on the node where the memory is located.

This is intended to be used in VM configurations that properly
virtualize NUMA. i.e. VMs with one or more virtual NUMA nodes, each of
which is mapped to a physical NUMA node. With this capability enabled
on such VMs, any guest memory access to virtually-local memory will be
translated through mostly[*] physically-local page tables, regardless
of how the memory was faulted in.

[*] KVM will fallback to allocating from remote NUMA nodes if the
preferred node is out of memory. Also, in VMs with 2 or more NUMA
nodes, higher level page tables will necessarily map memory across
multiple physical nodes.

> So, a multi NUMA node
> +guest, can with high confidence access local memory faster instead of going
> +through remote page tables first.
> +
> +This capability is also helpful for host to reduce live migration impact when
> +splitting huge pages during dirty log operations. If the thread splitting huge
> +page is on remote NUMA node it will create page tables on remote node. Even if
> +guest is careful in making sure that it only access local memory they will end
> +up accessing remote page tables.

Please also cover the limitations of this feature:

- Impact on remote memory accesses (more expensive).
- How KVM handles NUMA node exhaustion.
- How high-level page tables can span multiple nodes.
- What KVM does if it can't determine the NUMA node of the pfn.
- What KVM does for faults on GPAs that aren't backed by a pfn.

> +
> 8. Other capabilities.
> ======================
>
> --
> 2.40.0.rc0.216.gc4246ad0f0-goog
>