Re: [PATCH v3 3/4] x86, boot: Implement ASLR for kernel memory sections (x86_64)

From: Kees Cook
Date: Tue May 10 2016 - 14:53:50 EST


On Tue, May 3, 2016 at 12:31 PM, Thomas Garnier <thgarnie@xxxxxxxxxx> wrote:
> Randomizes the virtual address space of kernel memory sections (physical
> memory mapping, vmalloc & vmemmap) for x86_64. This security feature
> mitigates exploits relying on predictable kernel addresses. These
> addresses can be used to disclose the kernel modules base addresses or
> corrupt specific structures to elevate privileges bypassing the current
> implementation of KASLR. This feature can be enabled with the
> CONFIG_RANDOMIZE_MEMORY option.

I'm struggling to come up with a more accurate name for this, since
it's a base randomization of the kernel memory sections. Everything
else seems needlessly long (CONFIG_RANDOMIZE_BASE_MEMORY). I wonder if
things should be renamed generally to CONFIG_KASLR_BASE,
CONFIG_KASLR_MEMORY, etc, but that doesn't need to be part of this
series. Let's leave this as-is, and just make sure it's clear in the
Kconfig.

> The physical memory mapping holds most allocations from boot and heap
> allocators. Knowning the base address and physical memory size, an
> attacker can deduce the PDE virtual address for the vDSO memory page.
> This attack was demonstrated at CanSecWest 2016, in the "Getting
Physical Extreme Abuse of Intel Based Paged Systems"
> https://goo.gl/ANpWdV (see second part of the presentation). The
> exploits used against Linux worked successfuly against 4.6+ but fail
> with KASLR memory enabled (https://goo.gl/iTtXMJ). Similar research
> was done at Google leading to this patch proposal. Variants exists to
> overwrite /proc or /sys objects ACLs leading to elevation of privileges.
> These variants were testeda against 4.6+.

Typo "tested".

>
> The vmalloc memory section contains the allocation made through the
> vmalloc api. The allocations are done sequentially to prevent
> fragmentation and each allocation address can easily be deduced
> especially from boot.
>
> The vmemmap section holds a representation of the physical
> memory (through a struct page array). An attacker could use this section
> to disclose the kernel memory layout (walking the page linked list).
>
> The order of each memory section is not changed. The feature looks at
> the available space for the sections based on different configuration
> options and randomizes the base and space between each. The size of the
> physical memory mapping is the available physical memory. No performance
> impact was detected while testing the feature.
>
> Entropy is generated using the KASLR early boot functions now shared in
> the lib directory (originally written by Kees Cook). Randomization is
> done on PGD & PUD page table levels to increase possible addresses. The
> physical memory mapping code was adapted to support PUD level virtual
> addresses. An additional low memory page is used to ensure each CPU can
> start with a PGD aligned virtual address (for realmode).
>
> x86/dump_pagetable was updated to correctly display each section.
>
> Updated documentation on x86_64 memory layout accordingly.
>
> Performance data:
>
> Kernbench shows almost no difference (-+ less than 1%):
>
> Before:
>
> Average Optimal load -j 12 Run (std deviation):
> Elapsed Time 102.63 (1.2695)
> User Time 1034.89 (1.18115)
> System Time 87.056 (0.456416)
> Percent CPU 1092.9 (13.892)
> Context Switches 199805 (3455.33)
> Sleeps 97907.8 (900.636)
>
> After:
>
> Average Optimal load -j 12 Run (std deviation):
> Elapsed Time 102.489 (1.10636)
> User Time 1034.86 (1.36053)
> System Time 87.764 (0.49345)
> Percent CPU 1095 (12.7715)
> Context Switches 199036 (4298.1)
> Sleeps 97681.6 (1031.11)
>
> Hackbench shows 0% difference on average (hackbench 90
> repeated 10 times):
>
> attemp,before,after
> 1,0.076,0.069
> 2,0.072,0.069
> 3,0.066,0.066
> 4,0.066,0.068
> 5,0.066,0.067
> 6,0.066,0.069
> 7,0.067,0.066
> 8,0.063,0.067
> 9,0.067,0.065
> 10,0.068,0.071
> average,0.0677,0.0677
>
> Signed-off-by: Thomas Garnier <thgarnie@xxxxxxxxxx>
> ---
> Based on next-20160502
> ---
> Documentation/x86/x86_64/mm.txt | 4 +
> arch/x86/Kconfig | 15 ++++
> arch/x86/include/asm/kaslr.h | 12 +++
> arch/x86/include/asm/page_64_types.h | 11 ++-
> arch/x86/include/asm/pgtable_64.h | 1 +
> arch/x86/include/asm/pgtable_64_types.h | 15 +++-
> arch/x86/kernel/head_64.S | 2 +-
> arch/x86/kernel/setup.c | 3 +
> arch/x86/mm/Makefile | 1 +
> arch/x86/mm/dump_pagetables.c | 11 ++-
> arch/x86/mm/init.c | 4 +
> arch/x86/mm/kaslr.c | 136 ++++++++++++++++++++++++++++++++
> arch/x86/realmode/init.c | 4 +
> 13 files changed, 211 insertions(+), 8 deletions(-)
> create mode 100644 arch/x86/mm/kaslr.c
>
> diff --git a/Documentation/x86/x86_64/mm.txt b/Documentation/x86/x86_64/mm.txt
> index 5aa7383..602a52d 100644
> --- a/Documentation/x86/x86_64/mm.txt
> +++ b/Documentation/x86/x86_64/mm.txt
> @@ -39,4 +39,8 @@ memory window (this size is arbitrary, it can be raised later if needed).
> The mappings are not part of any other kernel PGD and are only available
> during EFI runtime calls.
>
> +Note that if CONFIG_RANDOMIZE_MEMORY is enabled, the direct mapping of all
> +physical memory, vmalloc/ioremap space and virtual memory map are randomized.
> +Their order is preserved but their base will be changed early at boot time.

Maybe instead of "changed", say "offset"?

> +
> -Andi Kleen, Jul 2004
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 0b128b4..60f33c7 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1988,6 +1988,21 @@ config PHYSICAL_ALIGN
>
> Don't change this unless you know what you are doing.
>
> +config RANDOMIZE_MEMORY
> + bool "Randomize the kernel memory sections"
> + depends on X86_64
> + depends on RANDOMIZE_BASE

Does this actually _depend_ on RANDOMIZE_BASE? It needs the
lib/kaslr.c code, but this could operate without the kernel base
address having been randomized, correct?

> + default n

As such, maybe the default should be:

default RANDOMIZE_BASE

> + ---help---
> + Randomizes the virtual address of memory sections (physical memory

How about: Randomizes the base virtual address of kernel memory sections ...

> + mapping, vmalloc & vmemmap). This security feature mitigates exploits
> + relying on predictable memory locations.

And "This security feature makes exploits relying on predictable
memory locations less reliable." ?

> +
> + Base and padding between memory section is randomized. Their order is
> + not. Entropy is generated in the same way as RANDOMIZE_BASE.

Since base would be mentioned above and padding is separate, I'd change this to:

The order of allocations remains unchanged. Entropy is generated ...

> +
> + If unsure, say N.
> +
> config HOTPLUG_CPU
> bool "Support for hot-pluggable CPUs"
> depends on SMP
> diff --git a/arch/x86/include/asm/kaslr.h b/arch/x86/include/asm/kaslr.h
> index 2ae1429..12c7742 100644
> --- a/arch/x86/include/asm/kaslr.h
> +++ b/arch/x86/include/asm/kaslr.h
> @@ -3,4 +3,16 @@
>
> unsigned long kaslr_get_random_boot_long(void);
>
> +#ifdef CONFIG_RANDOMIZE_MEMORY
> +extern unsigned long page_offset_base;
> +extern unsigned long vmalloc_base;
> +extern unsigned long vmemmap_base;
> +
> +void kernel_randomize_memory(void);
> +void kaslr_trampoline_init(void);
> +#else
> +static inline void kernel_randomize_memory(void) { }
> +static inline void kaslr_trampoline_init(void) { }
> +#endif /* CONFIG_RANDOMIZE_MEMORY */
> +
> #endif
> diff --git a/arch/x86/include/asm/page_64_types.h b/arch/x86/include/asm/page_64_types.h
> index d5c2f8b..9215e05 100644
> --- a/arch/x86/include/asm/page_64_types.h
> +++ b/arch/x86/include/asm/page_64_types.h
> @@ -1,6 +1,10 @@
> #ifndef _ASM_X86_PAGE_64_DEFS_H
> #define _ASM_X86_PAGE_64_DEFS_H
>
> +#ifndef __ASSEMBLY__
> +#include <asm/kaslr.h>
> +#endif
> +
> #ifdef CONFIG_KASAN
> #define KASAN_STACK_ORDER 1
> #else
> @@ -32,7 +36,12 @@
> * hypervisor to fit. Choosing 16 slots here is arbitrary, but it's
> * what Xen requires.
> */
> -#define __PAGE_OFFSET _AC(0xffff880000000000, UL)
> +#define __PAGE_OFFSET_BASE _AC(0xffff880000000000, UL)
> +#ifdef CONFIG_RANDOMIZE_MEMORY
> +#define __PAGE_OFFSET page_offset_base
> +#else
> +#define __PAGE_OFFSET __PAGE_OFFSET_BASE
> +#endif /* CONFIG_RANDOMIZE_MEMORY */
>
> #define __START_KERNEL_map _AC(0xffffffff80000000, UL)
>
> diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
> index 2ee7811..0dfec89 100644
> --- a/arch/x86/include/asm/pgtable_64.h
> +++ b/arch/x86/include/asm/pgtable_64.h
> @@ -21,6 +21,7 @@ extern pmd_t level2_fixmap_pgt[512];
> extern pmd_t level2_ident_pgt[512];
> extern pte_t level1_fixmap_pgt[512];
> extern pgd_t init_level4_pgt[];
> +extern pgd_t trampoline_pgd_entry;
>
> #define swapper_pg_dir init_level4_pgt
>
> diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
> index e6844df..d388739 100644
> --- a/arch/x86/include/asm/pgtable_64_types.h
> +++ b/arch/x86/include/asm/pgtable_64_types.h
> @@ -5,6 +5,7 @@
>
> #ifndef __ASSEMBLY__
> #include <linux/types.h>
> +#include <asm/kaslr.h>
>
> /*
> * These are used to make use of C type-checking..
> @@ -54,9 +55,17 @@ typedef struct { pteval_t pte; } pte_t;
>
> /* See Documentation/x86/x86_64/mm.txt for a description of the memory map. */
> #define MAXMEM _AC(__AC(1, UL) << MAX_PHYSMEM_BITS, UL)
> -#define VMALLOC_START _AC(0xffffc90000000000, UL)
> -#define VMALLOC_END _AC(0xffffe8ffffffffff, UL)
> -#define VMEMMAP_START _AC(0xffffea0000000000, UL)
> +#define VMALLOC_SIZE_TB _AC(32, UL)
> +#define __VMALLOC_BASE _AC(0xffffc90000000000, UL)
> +#define __VMEMMAP_BASE _AC(0xffffea0000000000, UL)
> +#ifdef CONFIG_RANDOMIZE_MEMORY
> +#define VMALLOC_START vmalloc_base
> +#define VMEMMAP_START vmemmap_base
> +#else
> +#define VMALLOC_START __VMALLOC_BASE
> +#define VMEMMAP_START __VMEMMAP_BASE
> +#endif /* CONFIG_RANDOMIZE_MEMORY */
> +#define VMALLOC_END (VMALLOC_START + _AC((VMALLOC_SIZE_TB << 40) - 1, UL))
> #define MODULES_VADDR (__START_KERNEL_map + KERNEL_IMAGE_SIZE)
> #define MODULES_END _AC(0xffffffffff000000, UL)
> #define MODULES_LEN (MODULES_END - MODULES_VADDR)
> diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
> index 5df831e..03a2aa0 100644
> --- a/arch/x86/kernel/head_64.S
> +++ b/arch/x86/kernel/head_64.S
> @@ -38,7 +38,7 @@
>
> #define pud_index(x) (((x) >> PUD_SHIFT) & (PTRS_PER_PUD-1))
>
> -L4_PAGE_OFFSET = pgd_index(__PAGE_OFFSET)
> +L4_PAGE_OFFSET = pgd_index(__PAGE_OFFSET_BASE)
> L4_START_KERNEL = pgd_index(__START_KERNEL_map)
> L3_START_KERNEL = pud_index(__START_KERNEL_map)
>
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index c4e7b39..a261658 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -113,6 +113,7 @@
> #include <asm/prom.h>
> #include <asm/microcode.h>
> #include <asm/mmu_context.h>
> +#include <asm/kaslr.h>
>
> /*
> * max_low_pfn_mapped: highest direct mapped pfn under 4GB
> @@ -942,6 +943,8 @@ void __init setup_arch(char **cmdline_p)
>
> x86_init.oem.arch_setup();
>
> + kernel_randomize_memory();
> +
> iomem_resource.end = (1ULL << boot_cpu_data.x86_phys_bits) - 1;
> setup_memory_map();
> parse_setup_data();
> diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
> index 62c0043..96d2b84 100644
> --- a/arch/x86/mm/Makefile
> +++ b/arch/x86/mm/Makefile
> @@ -37,4 +37,5 @@ obj-$(CONFIG_NUMA_EMU) += numa_emulation.o
>
> obj-$(CONFIG_X86_INTEL_MPX) += mpx.o
> obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) += pkeys.o
> +obj-$(CONFIG_RANDOMIZE_MEMORY) += kaslr.o
>
> diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
> index 99bfb19..4a03f60 100644
> --- a/arch/x86/mm/dump_pagetables.c
> +++ b/arch/x86/mm/dump_pagetables.c
> @@ -72,9 +72,9 @@ static struct addr_marker address_markers[] = {
> { 0, "User Space" },
> #ifdef CONFIG_X86_64
> { 0x8000000000000000UL, "Kernel Space" },
> - { PAGE_OFFSET, "Low Kernel Mapping" },
> - { VMALLOC_START, "vmalloc() Area" },
> - { VMEMMAP_START, "Vmemmap" },
> + { 0/* PAGE_OFFSET */, "Low Kernel Mapping" },
> + { 0/* VMALLOC_START */, "vmalloc() Area" },
> + { 0/* VMEMMAP_START */, "Vmemmap" },
> # ifdef CONFIG_X86_ESPFIX64
> { ESPFIX_BASE_ADDR, "ESPfix Area", 16 },
> # endif
> @@ -434,6 +434,11 @@ void ptdump_walk_pgd_level_checkwx(void)
>
> static int __init pt_dump_init(void)
> {
> +#ifdef CONFIG_X86_64
> + address_markers[LOW_KERNEL_NR].start_address = PAGE_OFFSET;
> + address_markers[VMALLOC_START_NR].start_address = VMALLOC_START;
> + address_markers[VMEMMAP_START_NR].start_address = VMEMMAP_START;
> +#endif
> #ifdef CONFIG_X86_32
> /* Not a compile-time constant on x86-32 */

I'd move this comment above you new ifdef and generalize it to something like:

/* Various markers are not compile-time constants, so assign them here. */

> address_markers[VMALLOC_START_NR].start_address = VMALLOC_START;
> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> index 372aad2..e490624 100644
> --- a/arch/x86/mm/init.c
> +++ b/arch/x86/mm/init.c
> @@ -17,6 +17,7 @@
> #include <asm/proto.h>
> #include <asm/dma.h> /* for MAX_DMA_PFN */
> #include <asm/microcode.h>
> +#include <asm/kaslr.h>
>
> /*
> * We need to define the tracepoints somewhere, and tlb.c
> @@ -590,6 +591,9 @@ void __init init_mem_mapping(void)
> /* the ISA range is always mapped regardless of memory holes */
> init_memory_mapping(0, ISA_END_ADDRESS);
>
> + /* Init the trampoline page table if needed for KASLR memory */
> + kaslr_trampoline_init();
> +
> /*
> * If the allocation is in bottom-up direction, we setup direct mapping
> * in bottom-up, otherwise we setup direct mapping in top-down.
> diff --git a/arch/x86/mm/kaslr.c b/arch/x86/mm/kaslr.c
> new file mode 100644
> index 0000000..3b330a9
> --- /dev/null
> +++ b/arch/x86/mm/kaslr.c
> @@ -0,0 +1,136 @@
> +#include <linux/kernel.h>
> +#include <linux/errno.h>
> +#include <linux/types.h>
> +#include <linux/mm.h>
> +#include <linux/smp.h>
> +#include <linux/init.h>
> +#include <linux/memory.h>
> +#include <linux/random.h>
> +
> +#include <asm/processor.h>
> +#include <asm/pgtable.h>
> +#include <asm/pgalloc.h>
> +#include <asm/e820.h>
> +#include <asm/init.h>
> +#include <asm/setup.h>
> +#include <asm/kaslr.h>
> +#include <asm/kasan.h>
> +
> +#include "mm_internal.h"
> +
> +/* Hold the pgd entry used on booting additional CPUs */
> +pgd_t trampoline_pgd_entry;
> +
> +#define TB_SHIFT 40
> +
> +/*
> + * Memory base and end randomization is based on different configurations.
> + * We want as much space as possible to increase entropy available.
> + */
> +static const unsigned long memory_rand_start = __PAGE_OFFSET_BASE;
> +
> +#if defined(CONFIG_KASAN)
> +static const unsigned long memory_rand_end = KASAN_SHADOW_START;
> +#elif defined(CONFIG_X86_ESPFIX64)
> +static const unsigned long memory_rand_end = ESPFIX_BASE_ADDR;
> +#elif defined(CONFIG_EFI)
> +static const unsigned long memory_rand_end = EFI_VA_START;
> +#else
> +static const unsigned long memory_rand_end = __START_KERNEL_map;
> +#endif

Is it worth adding BUILD_BUG_ON()s to verify these values stay in
decreasing size?

> +
> +/* Default values */
> +unsigned long page_offset_base = __PAGE_OFFSET_BASE;
> +EXPORT_SYMBOL(page_offset_base);
> +unsigned long vmalloc_base = __VMALLOC_BASE;
> +EXPORT_SYMBOL(vmalloc_base);
> +unsigned long vmemmap_base = __VMEMMAP_BASE;
> +EXPORT_SYMBOL(vmemmap_base);
> +
> +/* Describe each randomized memory sections in sequential order */
> +static struct kaslr_memory_region {
> + unsigned long *base;
> + unsigned short size_tb;
> +} kaslr_regions[] = {
> + { &page_offset_base, 64/* Maximum */ },
> + { &vmalloc_base, VMALLOC_SIZE_TB },
> + { &vmemmap_base, 1 },
> +};

This seems to be __init_data, since it's only used in kernel_randomize_memory()?

> +
> +/* Size in Terabytes + 1 hole */
> +static inline unsigned long get_padding(struct kaslr_memory_region *region)

I think this can be marked __init also?

> +{
> + return ((unsigned long)region->size_tb + 1) << TB_SHIFT;
> +}
> +
> +/* Initialize base and padding for each memory section randomized with KASLR */
> +void __init kernel_randomize_memory(void)
> +{
> + size_t i;
> + unsigned long addr = memory_rand_start;
> + unsigned long padding, rand, mem_tb;
> + struct rnd_state rnd_st;
> + unsigned long remain_padding = memory_rand_end - memory_rand_start;
> +
> + if (!kaslr_enabled())
> + return;
> +
> + BUG_ON(kaslr_regions[0].base != &page_offset_base);

This is statically assigned above, is this BUG_ON useful?

> + mem_tb = ((max_pfn << PAGE_SHIFT) >> TB_SHIFT);
> +
> + if (mem_tb < kaslr_regions[0].size_tb)
> + kaslr_regions[0].size_tb = mem_tb;

Can you add a comment for this? IIUC, this is just recalculating the
max memory size available for padding based on the page shift? Under
what situations would this be changing?

> +
> + for (i = 0; i < ARRAY_SIZE(kaslr_regions); i++)
> + remain_padding -= get_padding(&kaslr_regions[i]);
> +
> + prandom_seed_state(&rnd_st, kaslr_get_random_boot_long());
> +
> + /* Position each section randomly with minimum 1 terabyte between */
> + for (i = 0; i < ARRAY_SIZE(kaslr_regions); i++) {
> + padding = remain_padding / (ARRAY_SIZE(kaslr_regions) - i);
> + prandom_bytes_state(&rnd_st, &rand, sizeof(rand));
> + padding = (rand % (padding + 1)) & PUD_MASK;
> + addr += padding;
> + *kaslr_regions[i].base = addr;
> + addr += get_padding(&kaslr_regions[i]);
> + remain_padding -= padding;
> + }

What happens if we run out of padding here, and doesn't this loop mean
earlier regions will have, on average, more padding? Should each
instead randomize within a one-time calculation of remaining_padding /
ARRAY_SIZE(kaslr_regions) ?

Also, to get added to the Kconfig, what is the available entropy here?
How far can each of the base addresses get offset?

> +}
> +
> +/*
> + * Create PGD aligned trampoline table to allow real mode initialization
> + * of additional CPUs. Consume only 1 additonal low memory page.

Typo "additional".

> + */
> +void __meminit kaslr_trampoline_init(void)
> +{
> + unsigned long addr, next;
> + pgd_t *pgd;
> + pud_t *pud_page, *tr_pud_page;
> + int i;
> +
> + /* If KASLR is disabled, default to the existing page table entry */
> + if (!kaslr_enabled()) {
> + trampoline_pgd_entry = init_level4_pgt[pgd_index(PAGE_OFFSET)];
> + return;
> + }
> +
> + tr_pud_page = alloc_low_page();
> + set_pgd(&trampoline_pgd_entry, __pgd(_PAGE_TABLE | __pa(tr_pud_page)));
> +
> + addr = 0;
> + pgd = pgd_offset_k((unsigned long)__va(addr));
> + pud_page = (pud_t *) pgd_page_vaddr(*pgd);
> +
> + for (i = pud_index(addr); i < PTRS_PER_PUD; i++, addr = next) {
> + pud_t *pud, *tr_pud;
> +
> + tr_pud = tr_pud_page + pud_index(addr);
> + pud = pud_page + pud_index((unsigned long)__va(addr));
> + next = (addr & PUD_MASK) + PUD_SIZE;
> +
> + /* Needed to copy pte or pud alike */
> + BUILD_BUG_ON(sizeof(pud_t) != sizeof(pte_t));
> + *tr_pud = *pud;
> + }
> +}
> diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c
> index 0b7a63d..6518314 100644
> --- a/arch/x86/realmode/init.c
> +++ b/arch/x86/realmode/init.c
> @@ -84,7 +84,11 @@ void __init setup_real_mode(void)
> *trampoline_cr4_features = __read_cr4();
>
> trampoline_pgd = (u64 *) __va(real_mode_header->trampoline_pgd);
> +#ifdef CONFIG_RANDOMIZE_MEMORY
> + trampoline_pgd[0] = trampoline_pgd_entry.pgd;
> +#else
> trampoline_pgd[0] = init_level4_pgt[pgd_index(__PAGE_OFFSET)].pgd;
> +#endif

To avoid this ifdefs, could trampoline_pgd_entry instead be defined
outside of mm/kaslr.c and have .pgd assigned as
init_level4_pgt[pgd_index(__PAGE_OFFSET)].pgd via a static inline of
kaslr_trampoline_init() instead?

> trampoline_pgd[511] = init_level4_pgt[511].pgd;
> #endif
> }
> --
> 2.8.0.rc3.226.g39d4020
>

-Kees

--
Kees Cook
Chrome OS & Brillo Security