[PATCH mm-unstable v1 0/5] mm/kvm: lockless accessed bit harvest

From: Yu Zhao
Date: Thu Feb 16 2023 - 23:12:41 EST


TLDR
====
This patchset RCU-protects KVM page tables and compare-and-exchanges
KVM PTEs with the accessed bit set by hardware. It significantly
improves the performance of guests when the host is under heavy
memory pressure.

ChromeOS has been using a similar approach [1] since mid 2021 and it
was proven successful on tens of millions devices.

[1] https://crrev.com/c/2987928

Overview
========
The goal of this patchset is to optimize the performance of guests
when the host memory is overcommitted. It focuses on the vast
majority of VMs that are not nested and run on hardware that sets the
accessed bit in KVM page tables.

Note that nested VMs and hardware that does not support the accessed
bit are both out of scope.

This patchset relies on two techniques, RCU and cmpxchg, to safely
test and clear the accessed bit without taking kvm->mmu_lock. The
former protects KVM page tables from being freed while the latter
clears the accessed bit atomically against both hardware and other
software page table walkers.

A new MMU notifier API, mmu_notifier_test_clear_young(), is
introduced. It follows two design patterns: fallback and batching.
For any unsupported cases, it can optionally fall back to
mmu_notifier_ops->clear_young(). For a range of KVM PTEs, it can test
or test and clear their accessed bits according to a bitmap provided
by the caller.

This patchset only applies mmu_notifier_test_clear_young() to MGLRU.
A follow-up patchset will apply it to /proc/PID/pagemap and
/prod/PID/clear_refs.

Evaluation
==========
An existing selftest can quickly demonstrate the effectiveness of
this patchset. On a generic workstation equipped with 64 CPUs and
256GB DRAM:

$ sudo max_guest_memory_test -c 64 -m 256 -s 256

MGLRU run2
---------------
Before ~600s
After ~50s
Off ~250s

kswapd (MGLRU before)
100.00% balance_pgdat
100.00% shrink_node
100.00% shrink_one
99.97% try_to_shrink_lruvec
99.06% evict_folios
97.41% shrink_folio_list
31.33% folio_referenced
31.06% rmap_walk_file
30.89% folio_referenced_one
20.83% __mmu_notifier_clear_flush_young
20.54% kvm_mmu_notifier_clear_flush_young
=> 19.34% _raw_write_lock

kswapd (MGLRU after)
100.00% balance_pgdat
100.00% shrink_node
100.00% shrink_one
99.97% try_to_shrink_lruvec
99.51% evict_folios
71.70% shrink_folio_list
7.08% folio_referenced
6.78% rmap_walk_file
6.72% folio_referenced_one
5.60% lru_gen_look_around
=> 1.53% __mmu_notifier_test_clear_young

kswapd (MGLRU off)
100.00% balance_pgdat
100.00% shrink_node
99.92% shrink_lruvec
69.95% shrink_folio_list
19.35% folio_referenced
18.37% rmap_walk_file
17.88% folio_referenced_one
13.20% __mmu_notifier_clear_flush_young
11.64% kvm_mmu_notifier_clear_flush_young
=> 9.93% _raw_write_lock
26.23% shrink_active_list
25.50% folio_referenced
25.35% rmap_walk_file
25.28% folio_referenced_one
23.87% __mmu_notifier_clear_flush_young
23.69% kvm_mmu_notifier_clear_flush_young
=> 18.98% _raw_write_lock

Comprehensive benchmarks are coming soon.

Yu Zhao (5):
mm/kvm: add mmu_notifier_test_clear_young()
kvm/x86: add kvm_arch_test_clear_young()
kvm/arm64: add kvm_arch_test_clear_young()
kvm/powerpc: add kvm_arch_test_clear_young()
mm: multi-gen LRU: use mmu_notifier_test_clear_young()

arch/arm64/include/asm/kvm_host.h | 7 ++
arch/arm64/include/asm/kvm_pgtable.h | 8 ++
arch/arm64/include/asm/stage2_pgtable.h | 43 ++++++++
arch/arm64/kvm/arm.c | 1 +
arch/arm64/kvm/hyp/pgtable.c | 51 ++--------
arch/arm64/kvm/mmu.c | 77 +++++++++++++-
arch/powerpc/include/asm/kvm_host.h | 18 ++++
arch/powerpc/include/asm/kvm_ppc.h | 14 +--
arch/powerpc/kvm/book3s.c | 7 ++
arch/powerpc/kvm/book3s.h | 2 +
arch/powerpc/kvm/book3s_64_mmu_radix.c | 78 ++++++++++++++-
arch/powerpc/kvm/book3s_hv.c | 10 +-
arch/x86/include/asm/kvm_host.h | 27 +++++
arch/x86/kvm/mmu/spte.h | 12 ---
arch/x86/kvm/mmu/tdp_mmu.c | 41 ++++++++
include/linux/kvm_host.h | 29 ++++++
include/linux/mmu_notifier.h | 40 ++++++++
include/linux/mmzone.h | 6 +-
mm/mmu_notifier.c | 26 +++++
mm/rmap.c | 8 +-
mm/vmscan.c | 127 +++++++++++++++++++++---
virt/kvm/kvm_main.c | 58 +++++++++++
22 files changed, 593 insertions(+), 97 deletions(-)

--
2.39.2.637.g21b0678d19-goog