Re: [RFC PATCH 00/47] Address Space Isolation for KVM

From: Hyeonggon Yoo
Date: Fri Mar 04 2022 - 22:39:22 EST


On Tue, Feb 22, 2022 at 09:21:36PM -0800, Junaid Shahid wrote:
> This patch series is a proof-of-concept RFC for an end-to-end implementation of
> Address Space Isolation for KVM. It has similar goals and a somewhat similar
> high-level design as the original ASI patches from Alexandre Chartre
> ([1],[2],[3],[4]), but with a different underlying implementation. This also
> includes several memory management changes to help with differentiating between
> sensitive and non-sensitive memory and mapping of non-sensitive memory into the
> ASI restricted address spaces.
>
> This RFC is intended as a demonstration of what a full ASI implementation for
> KVM could look like, not necessarily as a direct proposal for what might
> eventually be merged. In particular, these patches do not yet implement KPTI on
> top of ASI, although the framework is generic enough to be able to support it.
> Similarly, these patches do not include non-sensitive annotations for data
> structures that did not get frequently accessed during execution of our test
> workloads, but the framework is designed such that new non-sensitive memory
> annotations can be added trivially.
>
> The patches apply on top of Linux v5.16. These patches are also available via
> gerrit at https://linux-review.googlesource.com/q/topic:asi-rfc.

[+Cc slab maintainers/reviewers]

Please Cc relevant people.
patch 14, 24, 31 need to be reviewed by slab people :)

> Background
> ==========
> Address Space Isolation is a comprehensive security mitigation for several types
> of speculative execution attacks. Even though the kernel already has several
> speculative execution vulnerability mitigations, some of them can be quite
> expensive if enabled fully e.g. to fully mitigate L1TF using the existing
> mechanisms requires doing an L1 cache flush on every single VM entry as well as
> disabling hyperthreading altogether. (Although core scheduling can provide some
> protection when hyperthreading is enabled, it is not sufficient by itself to
> protect against all leaks unless sibling hyperthread stunning is also performed
> on every VM exit.) ASI provides a much less expensive mitigation for such
> vulnerabilities while still providing an almost similar level of protection.
>
> There are a couple of basic insights/assumptions behind ASI:
>
> 1. Most execution paths in the kernel (especially during virtual machine
> execution) access only memory that is not particularly sensitive even if it were
> to get leaked to the executing process/VM (setting aside for a moment what
> exactly should be considered sensitive or non-sensitive).
> 2. Even when executing speculatively, the CPU can generally only bring memory
> that is mapped in the current page tables into its various caches and internal
> buffers.
>
> Given these, the idea of using ASI to thwart speculative attacks is that we can
> execute the kernel using a restricted set of page tables most of the time and
> switch to the full unrestricted kernel address space only when the kernel needs
> to access something that is not mapped in the restricted address space. And we
> keep track of when a switch to the full kernel address space is done, so that
> before returning back to the process/VM, we can switch back to the restricted
> address space. In the paths where the kernel is able to execute entirely while
> remaining in the restricted address space, we can skip other mitigations for
> speculative execution attacks (such as L1 cache / micro-arch buffer flushes,
> sibling hyperthread stunning etc.). Only in the cases where we do end up
> switching the page tables, we perform these more expensive mitigations. Assuming
> that happens relatively infrequently, the performance can be significantly
> better compared to performing these mitigations all the time.
>
> Please note that although we do have a sibling hyperthread stunning
> implementation internally, which is fully integrated with KVM-ASI, it is not
> included in this RFC for the time being. The earlier upstream proposal for
> sibling stunning [6] could potentially be integrated into an upstream ASI
> implementation.
>
> Basic concepts
> ==============
> Different types of restricted address spaces are represented by different ASI
> classes. For instance, KVM-ASI is an ASI class used during VM execution. KPTI
> would be another ASI class. An ASI instance (struct asi) represents a single
> restricted address space. There is a separate ASI instance for each untrusted
> context (e.g. a userspace process, a VM, or even a single VCPU etc.) Note that
> there can be multiple untrusted security contexts (and thus multiple restricted
> address spaces) within a single process e.g. in the case of VMs, the userspace
> process is a different security context than the guest VM, and in principle,
> even each VCPU could be considered a separate security context (That would be
> primarily useful for securing nested virtualization).
>
> In this RFC, a process can have at most one ASI instance of each class, though
> this is not an inherent limitation and multiple instances of the same class
> should eventually be supported. (A process can still have ASI instances of
> different classes e.g. KVM-ASI and KPTI.) In fact, in principle, it is not even
> entirely necessary to tie an ASI instance to a process. That is just a
> simplification for the initial implementation.
>
> An asi_enter operation switches into the restricted address space represented by
> the given ASI instance. An asi_exit operation switches to the full unrestricted
> kernel address space. Each ASI class can provide hooks to be executed during
> these operations, which can be used to perform speculative attack mitigations
> relevant to that class. For instance, the KVM-ASI hooks would perform a
> sibling-hyperthread-stun operation in the asi_exit hook, and L1-flush/MDS-clear
> and sibling-hyperthread-unstun operations in the asi_enter hook. On the other
> hand, the hooks for the KPTI class would be NO-OP, since the switching of the
> page tables is enough mitigation in that case.
>
> If the kernel attempts to access memory that is not mapped in the currently
> active ASI instance, the page fault handler automatically performs an asi_exit
> operation. This means that except for a few critical pieces of memory, leaving
> something out of an unrestricted address space will result in only a performance
> hit, rather than a catastrophic failure. The kernel can also perform explicit
> asi_exit operations in some paths as needed.
>
> Apart from the page fault handler, other exceptions and interrupts (even NMIs)
> do not automatically cause an asi_exit and could potentially be executed
> completely within a restricted address space if they don't end up accessing any
> sensitive piece of memory.
>
> The mappings within a restricted address space are always a subset of the full
> kernel address space and each mapping is always the same as the corresponding
> mapping in the full kernel address space. This is necessary because we could
> potentially end up performing an asi_exit at any point.
>
> Although this RFC only includes an implementation of the KVM-ASI class, a KPTI
> class could also be implemented on top of the same infrastructure. Furthermore,
> in the future we could also implement a KPTI-Next class that actually uses the
> ASI model for userspace processes i.e. mapping non-sensitive kernel memory in
> the restricted address space and trying to execute most syscalls/interrupts
> without switching to the full kernel address space, as opposed to the current
> KPTI which requires an address space switch on every kernel/user mode
> transition.
>
> Memory classification
> =====================
> We divide memory into three categories.
>
> 1. Sensitive memory
> This is memory that should never get leaked to any process or VM. Sensitive
> memory is only mapped in the unrestricted kernel page tables. By default, all
> memory is considered sensitive unless specifically categorized otherwise.
>
> 2. Globally non-sensitive memory
> This is memory that does not present a substantial security threat even if it
> were to get leaked to any process or VM in the system. Globally non-sensitive
> memory is mapped in the restricted address spaces for all processes.
>
> 3. Locally non-sensitive memory
> This is memory that does not present a substantial security threat if it were to
> get leaked to the currently running process or VM, but would present a security
> issue if it were to get leaked to any other process or VM in the system.
> Examples include userspace memory (or guest memory in the case of VMs) or kernel
> structures containing userspace/guest register context etc. Locally
> non-sensitive memory is mapped only in the restricted address space of a single
> process.
>
> Various mechanisms are provided to annotate different types of memory (static,
> buddy allocator, slab, vmalloc etc.) as globally or locally non-sensitive. In
> addition, the ASI infrastructure takes care to ensure that different classes of
> memory do not share the same physical page. This includes separation of
> sensitive, globally non-sensitive and locally non-sensitive memory into
> different pages and also separation of locally non-sensitive memory for
> different processes into different pages as well.
>
> What exactly should be considered non-sensitive (either globally or locally) is
> somewhat open-ended. Some things are clearly sensitive or non-sensitive, but
> many things also fall into a gray area, depending on how paranoid one wants to
> be. For this proof of concept, we have generally treated such things as
> non-sensitive, though that may not necessarily be the ideal classification in
> each case. Similarly, there is also a gray area between globally and locally
> non-sensitive classifications in some cases, and in those cases this RFC has
> mostly erred on the side of marking them as locally non-sensitive, even though
> many of those cases could likely be safely classified as globally non-sensitive.
>
> Although this implementation includes fairly extensive support for marking most
> types of dynamically allocated memory as locally non-sensitive, it is possibly
> feasible, at least for KVM-ASI, to get away with a simpler implementation (such
> as [5]), if we are very selective about what memory we treat as locally
> non-sensitive (as opposed to globally non-sensitive). Nevertheless, the more
> general mechanism is included in this proof of concept as an illustration for
> what could be done if we really needed to treat any arbitrary kernel memory as
> locally non-sensitive.
>
> It is also possible to have ASI classes that do not utilize the above described
> infrastructure and instead manage all the memory mappings inside the restricted
> address space on their own.
>
>
> References
> ==========
> [1] https://lore.kernel.org/lkml/1557758315-12667-1-git-send-email-alexandre.chartre@xxxxxxxxxx
> [2] https://lore.kernel.org/lkml/1562855138-19507-1-git-send-email-alexandre.chartre@xxxxxxxxxx
> [3] https://lore.kernel.org/lkml/1582734120-26757-1-git-send-email-alexandre.chartre@xxxxxxxxxx
> [4] https://lore.kernel.org/lkml/20200504144939.11318-1-alexandre.chartre@xxxxxxxxxx
> [5] https://lore.kernel.org/lkml/20190612170834.14855-1-mhillenb@xxxxxxxxx
> [6] https://lore.kernel.org/lkml/20200815031908.1015049-1-joel@xxxxxxxxxxxxxxxxx
>
> Cc: Paul Turner <pjt@xxxxxxxxxx>
> Cc: Jim Mattson <jmattson@xxxxxxxxxx>
> Cc: Alexandre Chartre <alexandre.chartre@xxxxxxxxxx>
> Cc: Mike Rapoport <rppt@xxxxxxxxxxxxx>
> Cc: Paolo Bonzini <pbonzini@xxxxxxxxxx>
> Cc: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>
> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> Cc: Andy Lutomirski <luto@xxxxxxxxxx>
>
>
> Junaid Shahid (32):
> mm: asi: Introduce ASI core API
> mm: asi: Add command-line parameter to enable/disable ASI
> mm: asi: Switch to unrestricted address space when entering scheduler
> mm: asi: ASI support in interrupts/exceptions
> mm: asi: Make __get_current_cr3_fast() ASI-aware
> mm: asi: ASI page table allocation and free functions
> mm: asi: Functions to map/unmap a memory range into ASI page tables
> mm: asi: Add basic infrastructure for global non-sensitive mappings
> mm: Add __PAGEFLAG_FALSE
> mm: asi: Support for global non-sensitive direct map allocations
> mm: asi: Global non-sensitive vmalloc/vmap support
> mm: asi: Support for global non-sensitive slab caches
> mm: asi: Disable ASI API when ASI is not enabled for a process
> kvm: asi: Restricted address space for VM execution
> mm: asi: Support for mapping non-sensitive pcpu chunks
> mm: asi: Aliased direct map for local non-sensitive allocations
> mm: asi: Support for pre-ASI-init local non-sensitive allocations
> mm: asi: Support for locally nonsensitive page allocations
> mm: asi: Support for locally non-sensitive vmalloc allocations
> mm: asi: Add support for locally non-sensitive VM_USERMAP pages
> mm: asi: Add support for mapping all userspace memory into ASI
> mm: asi: Support for local non-sensitive slab caches
> mm: asi: Avoid warning from NMI userspace accesses in ASI context
> mm: asi: Use separate PCIDs for restricted address spaces
> mm: asi: Avoid TLB flushes during ASI CR3 switches when possible
> mm: asi: Avoid TLB flush IPIs to CPUs not in ASI context
> mm: asi: Reduce TLB flushes when freeing pages asynchronously
> mm: asi: Add API for mapping userspace address ranges
> mm: asi: Support for non-sensitive SLUB caches
> x86: asi: Allocate FPU state separately when ASI is enabled.
> kvm: asi: Map guest memory into restricted ASI address space
> kvm: asi: Unmap guest memory from ASI address space when using nested
> virt
>
> Ofir Weisse (15):
> asi: Added ASI memory cgroup flag
> mm: asi: Added refcounting when initilizing an asi
> mm: asi: asi_exit() on PF, skip handling if address is accessible
> mm: asi: Adding support for dynamic percpu ASI allocations
> mm: asi: ASI annotation support for static variables.
> mm: asi: ASI annotation support for dynamic modules.
> mm: asi: Skip conventional L1TF/MDS mitigations
> mm: asi: support for static percpu DEFINE_PER_CPU*_ASI
> mm: asi: Annotation of static variables to be nonsensitive
> mm: asi: Annotation of PERCPU variables to be nonsensitive
> mm: asi: Annotation of dynamic variables to be nonsensitive
> kvm: asi: Splitting kvm_vcpu_arch into non/sensitive parts
> mm: asi: Mapping global nonsensitive areas in asi_global_init
> kvm: asi: Do asi_exit() in vcpu_run loop before returning to userspace
> mm: asi: Properly un/mapping task stack from ASI + tlb flush
>
> arch/alpha/include/asm/Kbuild | 1 +
> arch/arc/include/asm/Kbuild | 1 +
> arch/arm/include/asm/Kbuild | 1 +
> arch/arm64/include/asm/Kbuild | 1 +
> arch/csky/include/asm/Kbuild | 1 +
> arch/h8300/include/asm/Kbuild | 1 +
> arch/hexagon/include/asm/Kbuild | 1 +
> arch/ia64/include/asm/Kbuild | 1 +
> arch/m68k/include/asm/Kbuild | 1 +
> arch/microblaze/include/asm/Kbuild | 1 +
> arch/mips/include/asm/Kbuild | 1 +
> arch/nds32/include/asm/Kbuild | 1 +
> arch/nios2/include/asm/Kbuild | 1 +
> arch/openrisc/include/asm/Kbuild | 1 +
> arch/parisc/include/asm/Kbuild | 1 +
> arch/powerpc/include/asm/Kbuild | 1 +
> arch/riscv/include/asm/Kbuild | 1 +
> arch/s390/include/asm/Kbuild | 1 +
> arch/sh/include/asm/Kbuild | 1 +
> arch/sparc/include/asm/Kbuild | 1 +
> arch/um/include/asm/Kbuild | 1 +
> arch/x86/events/core.c | 6 +-
> arch/x86/events/intel/bts.c | 2 +-
> arch/x86/events/intel/core.c | 2 +-
> arch/x86/events/msr.c | 2 +-
> arch/x86/events/perf_event.h | 4 +-
> arch/x86/include/asm/asi.h | 215 ++++
> arch/x86/include/asm/cpufeatures.h | 1 +
> arch/x86/include/asm/current.h | 2 +-
> arch/x86/include/asm/debugreg.h | 2 +-
> arch/x86/include/asm/desc.h | 2 +-
> arch/x86/include/asm/disabled-features.h | 8 +-
> arch/x86/include/asm/fpu/api.h | 3 +-
> arch/x86/include/asm/hardirq.h | 2 +-
> arch/x86/include/asm/hw_irq.h | 2 +-
> arch/x86/include/asm/idtentry.h | 25 +-
> arch/x86/include/asm/kvm_host.h | 124 +-
> arch/x86/include/asm/page.h | 19 +-
> arch/x86/include/asm/page_64.h | 27 +-
> arch/x86/include/asm/page_64_types.h | 20 +
> arch/x86/include/asm/percpu.h | 2 +-
> arch/x86/include/asm/pgtable_64_types.h | 10 +
> arch/x86/include/asm/preempt.h | 2 +-
> arch/x86/include/asm/processor.h | 17 +-
> arch/x86/include/asm/smp.h | 2 +-
> arch/x86/include/asm/tlbflush.h | 49 +-
> arch/x86/include/asm/topology.h | 2 +-
> arch/x86/kernel/alternative.c | 2 +-
> arch/x86/kernel/apic/apic.c | 2 +-
> arch/x86/kernel/apic/x2apic_cluster.c | 8 +-
> arch/x86/kernel/cpu/bugs.c | 2 +-
> arch/x86/kernel/cpu/common.c | 12 +-
> arch/x86/kernel/e820.c | 7 +-
> arch/x86/kernel/fpu/core.c | 47 +-
> arch/x86/kernel/fpu/init.c | 7 +-
> arch/x86/kernel/fpu/internal.h | 1 +
> arch/x86/kernel/fpu/xstate.c | 21 +-
> arch/x86/kernel/head_64.S | 12 +
> arch/x86/kernel/hw_breakpoint.c | 2 +-
> arch/x86/kernel/irq.c | 2 +-
> arch/x86/kernel/irqinit.c | 2 +-
> arch/x86/kernel/nmi.c | 6 +-
> arch/x86/kernel/process.c | 13 +-
> arch/x86/kernel/setup.c | 4 +-
> arch/x86/kernel/setup_percpu.c | 4 +-
> arch/x86/kernel/smp.c | 2 +-
> arch/x86/kernel/smpboot.c | 3 +-
> arch/x86/kernel/traps.c | 2 +
> arch/x86/kernel/tsc.c | 10 +-
> arch/x86/kernel/vmlinux.lds.S | 2 +-
> arch/x86/kvm/cpuid.c | 18 +-
> arch/x86/kvm/kvm_cache_regs.h | 22 +-
> arch/x86/kvm/lapic.c | 11 +-
> arch/x86/kvm/mmu.h | 16 +-
> arch/x86/kvm/mmu/mmu.c | 209 ++--
> arch/x86/kvm/mmu/mmu_internal.h | 2 +-
> arch/x86/kvm/mmu/paging_tmpl.h | 40 +-
> arch/x86/kvm/mmu/spte.c | 6 +-
> arch/x86/kvm/mmu/spte.h | 2 +-
> arch/x86/kvm/mmu/tdp_mmu.c | 14 +-
> arch/x86/kvm/mtrr.c | 2 +-
> arch/x86/kvm/svm/nested.c | 34 +-
> arch/x86/kvm/svm/sev.c | 70 +-
> arch/x86/kvm/svm/svm.c | 52 +-
> arch/x86/kvm/trace.h | 10 +-
> arch/x86/kvm/vmx/capabilities.h | 14 +-
> arch/x86/kvm/vmx/nested.c | 90 +-
> arch/x86/kvm/vmx/vmx.c | 152 ++-
> arch/x86/kvm/x86.c | 315 +++--
> arch/x86/kvm/x86.h | 4 +-
> arch/x86/mm/Makefile | 1 +
> arch/x86/mm/asi.c | 1397 ++++++++++++++++++++++
> arch/x86/mm/fault.c | 67 +-
> arch/x86/mm/init.c | 7 +-
> arch/x86/mm/init_64.c | 26 +-
> arch/x86/mm/kaslr.c | 34 +-
> arch/x86/mm/mm_internal.h | 5 +
> arch/x86/mm/physaddr.c | 8 +
> arch/x86/mm/tlb.c | 419 ++++++-
> arch/xtensa/include/asm/Kbuild | 1 +
> fs/binfmt_elf.c | 2 +-
> fs/eventfd.c | 2 +-
> fs/eventpoll.c | 10 +-
> fs/exec.c | 7 +
> fs/file.c | 3 +-
> fs/timerfd.c | 2 +-
> include/asm-generic/asi.h | 149 +++
> include/asm-generic/irq_regs.h | 2 +-
> include/asm-generic/percpu.h | 6 +
> include/asm-generic/vmlinux.lds.h | 36 +-
> include/linux/arch_topology.h | 2 +-
> include/linux/debug_locks.h | 4 +-
> include/linux/gfp.h | 13 +-
> include/linux/hrtimer.h | 2 +-
> include/linux/interrupt.h | 2 +-
> include/linux/jiffies.h | 4 +-
> include/linux/kernel_stat.h | 4 +-
> include/linux/kvm_host.h | 7 +-
> include/linux/kvm_types.h | 3 +
> include/linux/memcontrol.h | 3 +
> include/linux/mm_types.h | 59 +
> include/linux/module.h | 15 +
> include/linux/notifier.h | 2 +-
> include/linux/page-flags.h | 19 +
> include/linux/percpu-defs.h | 39 +
> include/linux/percpu.h | 8 +-
> include/linux/pgtable.h | 3 +
> include/linux/prandom.h | 2 +-
> include/linux/profile.h | 2 +-
> include/linux/rcupdate.h | 4 +-
> include/linux/rcutree.h | 2 +-
> include/linux/sched.h | 5 +
> include/linux/sched/mm.h | 12 +
> include/linux/sched/sysctl.h | 1 +
> include/linux/slab.h | 68 +-
> include/linux/slab_def.h | 4 +
> include/linux/slub_def.h | 6 +
> include/linux/vmalloc.h | 16 +-
> include/trace/events/mmflags.h | 14 +-
> init/main.c | 2 +-
> kernel/cgroup/cgroup.c | 9 +-
> kernel/cpu.c | 14 +-
> kernel/entry/common.c | 6 +
> kernel/events/core.c | 25 +-
> kernel/exit.c | 2 +
> kernel/fork.c | 69 +-
> kernel/freezer.c | 2 +-
> kernel/irq_work.c | 6 +-
> kernel/locking/lockdep.c | 14 +-
> kernel/module-internal.h | 1 +
> kernel/module.c | 210 +++-
> kernel/panic.c | 2 +-
> kernel/printk/printk.c | 4 +-
> kernel/profile.c | 4 +-
> kernel/rcu/srcutree.c | 3 +-
> kernel/rcu/tree.c | 12 +-
> kernel/rcu/update.c | 4 +-
> kernel/sched/clock.c | 2 +-
> kernel/sched/core.c | 23 +-
> kernel/sched/cpuacct.c | 10 +-
> kernel/sched/cpufreq.c | 3 +-
> kernel/sched/cputime.c | 4 +-
> kernel/sched/fair.c | 7 +-
> kernel/sched/loadavg.c | 2 +-
> kernel/sched/rt.c | 2 +-
> kernel/sched/sched.h | 25 +-
> kernel/sched/topology.c | 28 +-
> kernel/smp.c | 26 +-
> kernel/softirq.c | 5 +-
> kernel/time/hrtimer.c | 4 +-
> kernel/time/jiffies.c | 8 +-
> kernel/time/ntp.c | 30 +-
> kernel/time/tick-common.c | 6 +-
> kernel/time/tick-internal.h | 6 +-
> kernel/time/tick-sched.c | 4 +-
> kernel/time/timekeeping.c | 10 +-
> kernel/time/timekeeping.h | 2 +-
> kernel/time/timer.c | 4 +-
> kernel/trace/ring_buffer.c | 5 +-
> kernel/trace/trace.c | 4 +-
> kernel/trace/trace_preemptirq.c | 2 +-
> kernel/trace/trace_sched_switch.c | 4 +-
> kernel/tracepoint.c | 2 +-
> kernel/watchdog.c | 12 +-
> lib/debug_locks.c | 5 +-
> lib/irq_regs.c | 2 +-
> lib/radix-tree.c | 6 +-
> lib/random32.c | 3 +-
> mm/init-mm.c | 2 +
> mm/internal.h | 3 +
> mm/memcontrol.c | 37 +-
> mm/memory.c | 4 +-
> mm/page_alloc.c | 204 +++-
> mm/percpu-internal.h | 23 +-
> mm/percpu-km.c | 5 +-
> mm/percpu-vm.c | 57 +-
> mm/percpu.c | 273 ++++-
> mm/slab.c | 42 +-
> mm/slab.h | 166 ++-
> mm/slab_common.c | 461 ++++++-
> mm/slub.c | 140 ++-
> mm/sparse.c | 4 +-
> mm/util.c | 3 +-
> mm/vmalloc.c | 193 ++-
> net/core/skbuff.c | 2 +-
> net/core/sock.c | 2 +-
> security/Kconfig | 12 +
> tools/perf/builtin-kmem.c | 2 +
> virt/kvm/coalesced_mmio.c | 2 +-
> virt/kvm/eventfd.c | 5 +-
> virt/kvm/kvm_main.c | 61 +-
> 211 files changed, 5727 insertions(+), 959 deletions(-)
> create mode 100644 arch/x86/include/asm/asi.h
> create mode 100644 arch/x86/mm/asi.c
> create mode 100644 include/asm-generic/asi.h
>
> --
> 2.35.1.473.g83b2b277ed-goog
>
>

--
Thank you, You are awesome!
Hyeonggon :-)