Re: [PATCH v4] mm/vmalloc: randomize vmalloc() allocations

From: Uladzislau Rezki
Date: Sun Mar 14 2021 - 13:30:13 EST


> Memory mappings inside kernel allocated with vmalloc() are in
> predictable order and packed tightly toward the low addresses, except
> for per-cpu areas which start from top of the vmalloc area. With
> new kernel boot parameter 'randomize_vmalloc=1', the entire area is
> used randomly to make the allocations less predictable and harder to
> guess for attackers. Also module and BPF code locations get randomized
> (within their dedicated and rather small area though) and if
> CONFIG_VMAP_STACK is enabled, also kernel thread stack locations.
>
> On 32 bit systems this may cause problems due to increased VM
> fragmentation if the address space gets crowded.
>
> On all systems, it will reduce performance and increase memory and
> cache usage due to less efficient use of page tables and inability to
> merge adjacent VMAs with compatible attributes. On x86_64 with 5 level
> page tables, in the worst case, additional page table entries of up to
> 4 pages are created for each mapping, so with small mappings there's
> considerable penalty.
>
> Without randomize_vmalloc=1:
> $ grep -v kernel_clone /proc/vmallocinfo
> 0xffffc90000000000-0xffffc90000009000 36864 irq_init_percpu_irqstack+0x176/0x1c0 vmap
> 0xffffc90000009000-0xffffc9000000b000 8192 acpi_os_map_iomem+0x2ac/0x2d0 phys=0x000000001ffe1000 ioremap
> 0xffffc9000000c000-0xffffc9000000f000 12288 acpi_os_map_iomem+0x2ac/0x2d0 phys=0x000000001ffe0000 ioremap
> 0xffffc9000000f000-0xffffc90000011000 8192 hpet_enable+0x31/0x4a4 phys=0x00000000fed00000 ioremap
> 0xffffc90000011000-0xffffc90000013000 8192 gen_pool_add_owner+0x49/0x130 pages=1 vmalloc
> 0xffffc90000013000-0xffffc90000015000 8192 gen_pool_add_owner+0x49/0x130 pages=1 vmalloc
> 0xffffc90000015000-0xffffc90000017000 8192 gen_pool_add_owner+0x49/0x130 pages=1 vmalloc
> 0xffffc90000021000-0xffffc90000023000 8192 gen_pool_add_owner+0x49/0x130 pages=1 vmalloc
> 0xffffc90000023000-0xffffc90000025000 8192 acpi_os_map_iomem+0x2ac/0x2d0 phys=0x00000000fed00000 ioremap
> 0xffffc90000025000-0xffffc90000027000 8192 memremap+0x19c/0x280 phys=0x00000000000f5000 ioremap
> 0xffffc90000031000-0xffffc90000036000 20480 pcpu_create_chunk+0xe8/0x260 pages=4 vmalloc
> 0xffffc90000043000-0xffffc90000047000 16384 n_tty_open+0x11/0xe0 pages=3 vmalloc
> 0xffffc90000211000-0xffffc90000232000 135168 crypto_scomp_init_tfm+0xc6/0xf0 pages=32 vmalloc
> 0xffffc90000232000-0xffffc90000253000 135168 crypto_scomp_init_tfm+0x67/0xf0 pages=32 vmalloc
> 0xffffc900005a9000-0xffffc900005ba000 69632 pcpu_create_chunk+0x7b/0x260 pages=16 vmalloc
> 0xffffc900005ba000-0xffffc900005cc000 73728 pcpu_create_chunk+0xb2/0x260 pages=17 vmalloc
> 0xffffe8ffffc00000-0xffffe8ffffe00000 2097152 pcpu_get_vm_areas+0x0/0x2290 vmalloc
>
> With randomize_vmalloc=1, the allocations are randomized:
> $ grep -v kernel_clone /proc/vmallocinfo
> 0xffffc9759d443000-0xffffc9759d445000 8192 hpet_enable+0x31/0x4a4 phys=0x00000000fed00000 ioremap
> 0xffffccf1e9f66000-0xffffccf1e9f68000 8192 gen_pool_add_owner+0x49/0x130 pages=1 vmalloc
> 0xffffcd2fc02a4000-0xffffcd2fc02a6000 8192 gen_pool_add_owner+0x49/0x130 pages=1 vmalloc
> 0xffffcdaefb898000-0xffffcdaefb89b000 12288 acpi_os_map_iomem+0x2ac/0x2d0 phys=0x000000001ffe0000 ioremap
> 0xffffcef8074c3000-0xffffcef8074cc000 36864 irq_init_percpu_irqstack+0x176/0x1c0 vmap
> 0xffffcf725ca2e000-0xffffcf725ca4f000 135168 crypto_scomp_init_tfm+0xc6/0xf0 pages=32 vmalloc
> 0xffffd0efb25e1000-0xffffd0efb25f2000 69632 pcpu_create_chunk+0x7b/0x260 pages=16 vmalloc
> 0xffffd27054678000-0xffffd2705467c000 16384 n_tty_open+0x11/0xe0 pages=3 vmalloc
> 0xffffd2adf716e000-0xffffd2adf7180000 73728 pcpu_create_chunk+0xb2/0x260 pages=17 vmalloc
> 0xffffd4ba5fb6b000-0xffffd4ba5fb6d000 8192 acpi_os_map_iomem+0x2ac/0x2d0 phys=0x000000001ffe1000 ioremap
> 0xffffded126192000-0xffffded126194000 8192 memremap+0x19c/0x280 phys=0x00000000000f5000 ioremap
> 0xffffe01a4dbcd000-0xffffe01a4dbcf000 8192 gen_pool_add_owner+0x49/0x130 pages=1 vmalloc
> 0xffffe4b649952000-0xffffe4b649954000 8192 acpi_os_map_iomem+0x2ac/0x2d0 phys=0x00000000fed00000 ioremap
> 0xffffe71ed592a000-0xffffe71ed592c000 8192 gen_pool_add_owner+0x49/0x130 pages=1 vmalloc
> 0xffffe7dc5824f000-0xffffe7dc58270000 135168 crypto_scomp_init_tfm+0x67/0xf0 pages=32 vmalloc
> 0xffffe8f4f9800000-0xffffe8f4f9a00000 2097152 pcpu_get_vm_areas+0x0/0x2290 vmalloc
> 0xffffe8f4f9a19000-0xffffe8f4f9a1e000 20480 pcpu_create_chunk+0xe8/0x260 pages=4 vmalloc
>
> With CONFIG_VMAP_STACK, also kernel thread stacks are placed in
> vmalloc area and therefore they also get randomized (only one example
> line from /proc/vmallocinfo shown for brevity):
>
> unrandomized:
> 0xffffc90000018000-0xffffc90000021000 36864 kernel_clone+0xf9/0x560 pages=8 vmalloc
>
> randomized:
> 0xffffcb57611a8000-0xffffcb57611b1000 36864 kernel_clone+0xf9/0x560 pages=8 vmalloc
>
> CC: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> CC: Andy Lutomirski <luto@xxxxxxxxxx>
> CC: Jann Horn <jannh@xxxxxxxxxx>
> CC: Kees Cook <keescook@xxxxxxxxxxxx>
> CC: Linux API <linux-api@xxxxxxxxxxxxxxx>
> CC: Matthew Wilcox <willy@xxxxxxxxxxxxx>
> CC: Mike Rapoport <rppt@xxxxxxxxxx>
> CC: Vlad Rezki <urezki@xxxxxxxxx>
> Signed-off-by: Topi Miettinen <toiwoton@xxxxxxxxx>
> ---
> v2: retry allocation from other end of vmalloc space in case of
> failure (Matthew Wilcox), improve commit message and documentation
> v3: randomize also percpu allocations (pcpu_get_vm_areas())
> v4: use static branches (Kees Cook) and make the parameter boolean.
> ---
> .../admin-guide/kernel-parameters.txt | 24 ++++++++++
> mm/vmalloc.c | 44 +++++++++++++++++--
> 2 files changed, 65 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index a10b545c2070..726aec542079 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -4024,6 +4024,30 @@
>
> ramdisk_start= [RAM] RAM disk image start address
>
> + randomize_vmalloc= [KNL] Boolean option to randomize vmalloc()
> + allocations. When enabled, the entire
> + vmalloc() area is used randomly to make the
> + allocations less predictable and harder to
> + guess for attackers. Also module and BPF code
> + locations get randomized (within their
> + dedicated and rather small area though) and if
> + CONFIG_VMAP_STACK is enabled, also kernel
> + thread stack locations.
> +
> + On 32 bit systems this may cause problems due
> + to increased VM fragmentation if the address
> + space gets crowded.
>
What kind of problems? Could you please more cpecific? I guess fail
ratio will be increased.

> +
> + On all systems, it will reduce performance and
> + increase memory and cache usage due to less
> + efficient use of page tables and inability to
> + merge adjacent VMAs with compatible
> + attributes. On x86_64 with 5 level page
> + tables, in the worst case, additional page
> + table entries of up to 4 pages are created for
> + each mapping, so with small mappings there's
> + considerable penalty.
Could you please put test results to the commit message? You can run
"test_vmalloc.sh performance" on you system. It will give us some
figures to understand the performance difference.

> +
> random.trust_cpu={on,off}
> [KNL] Enable or disable trusting the use of the
> CPU's random number generator (if available) to
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index e6f352bf0498..b5ecf27dc98e 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -34,6 +34,7 @@
> #include <linux/bitops.h>
> #include <linux/rbtree_augmented.h>
> #include <linux/overflow.h>
> +#include <linux/random.h>
>
> #include <linux/uaccess.h>
> #include <asm/tlbflush.h>
> @@ -1089,6 +1090,25 @@ adjust_va_to_fit_type(struct vmap_area *va,
> return 0;
> }
>
> +static DEFINE_STATIC_KEY_FALSE_RO(randomize_vmalloc);
> +
> +static int __init set_randomize_vmalloc(char *str)
> +{
> + int ret;
> + bool bool_result;
> +
> + ret = kstrtobool(str, &bool_result);
> + if (ret)
> + return ret;
> +
> + if (bool_result)
> + static_branch_enable(&randomize_vmalloc);
> + else
> + static_branch_disable(&randomize_vmalloc);
> + return 1;
> +}
> +__setup("randomize_vmalloc=", set_randomize_vmalloc);
> +
> /*
> * Returns a start address of the newly allocated area, if success.
> * Otherwise a vend is returned that indicates failure.
> @@ -1162,7 +1182,7 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
> int node, gfp_t gfp_mask)
> {
> struct vmap_area *va, *pva;
> - unsigned long addr;
> + unsigned long addr, voffset;
> int purged = 0;
> int ret;
>
> @@ -1217,11 +1237,24 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
> if (pva && __this_cpu_cmpxchg(ne_fit_preload_node, NULL, pva))
> kmem_cache_free(vmap_area_cachep, pva);
>
> + /* Randomize allocation */
> + if (static_branch_unlikely(&randomize_vmalloc)) {
> + voffset = get_random_long() & (roundup_pow_of_two(vend - vstart) - 1);
> + voffset = PAGE_ALIGN(voffset);
> + if (voffset + size > vend - vstart)
> + voffset = vend - vstart - size;
> + } else
> + voffset = 0;
> +
Could you please wrap that change into a separate function? For example
randomize_voffset_with_range(start, end).

> /*
> * If an allocation fails, the "vend" address is
> * returned. Therefore trigger the overflow path.
> */
> - addr = __alloc_vmap_area(size, align, vstart, vend);
> + addr = __alloc_vmap_area(size, align, vstart + voffset, vend);
> +
> + if (unlikely(addr == vend) && voffset)
> + /* Retry randomization from other end */
> + addr = __alloc_vmap_area(size, align, vstart, vstart + voffset + size);
> spin_unlock(&free_vmap_area_lock);
>
> if (unlikely(addr == vend))
> @@ -3258,7 +3291,12 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
> start = offsets[area];
> end = start + sizes[area];
>
> - va = pvm_find_va_enclose_addr(vmalloc_end);
> + if (static_branch_unlikely(&randomize_vmalloc))
> + va = pvm_find_va_enclose_addr(vmalloc_start +
> + (get_random_long() &
> + (roundup_pow_of_two(vmalloc_end - vmalloc_start) - 1)));
> + else
> + va = pvm_find_va_enclose_addr(vmalloc_end);
> base = pvm_determine_end_from_reverse(&va, align) - end;
As for per-cpu embedded alloator. Even though currently it is part of
the vmalloc space, it is not a vmalloc() allocation. Please do not change
its code. It does alloations by "chunks" where an internal structure
represent special memory layout that is used for actual allocations.


Also, using vmaloc test driver i can trigger a kernel BUG:

<snip>
[ 24.627577] kernel BUG at mm/vmalloc.c:1272!
[ 24.628645] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[ 24.628684] CPU: 30 PID: 929 Comm: vmalloc_test/0 Tainted: G E 5.11.0-next-20210225-next #484
[ 24.628724] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[ 24.628757] RIP: 0010:alloc_vmap_area.isra.53+0x593/0xf10
[ 24.628784] Code: 41 5d 41 5e 41 5f c3 0f 0b 0f 0b 48 c7 44 24 10 f0 ff ff ff eb c9 48 8d 5a f0 e9 9c fc ff ff 48 c7 44 24 10 f4 ff ff ff eb b5 <0f> 0b 4c 8d 4b 10 48 39 d0 74 12 48 8b 44 24 18 31 ff 48 89 03 48
[ 24.628853] RSP: 0018:ffffc4296cf67d38 EFLAGS: 00010206
[ 24.628876] RAX: ffffd6db9e61a000 RBX: ffff8ae9c9309440 RCX: 0000000000000001
[ 24.628905] RDX: 0000000080000001 RSI: ffff8ae9c0046be8 RDI: 00000000ffffffff
[ 24.628933] RBP: 0000000000002000 R08: ffff8ae9c13699e8 R09: ffffb98000000000
[ 24.628961] R10: ffffd6db9e61a000 R11: 000000003aa1c801 R12: ffff8ae9c9f0d280
[ 24.628989] R13: 0000008000001fff R14: ffffff8000000000 R15: 0000007fffffffff
[ 24.629019] FS: 0000000000000000(0000) GS:ffff8af8bf580000(0000) knlGS:0000000000000000
[ 24.629051] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 24.629074] CR2: 000055916370aa80 CR3: 00000006bf40a000 CR4: 00000000000006e0
[ 24.629103] Call Trace:
[ 24.629128] ? map_kernel_range_noflush+0x27a/0x360
[ 24.629150] ? kmem_cache_alloc_trace+0x340/0x460
[ 24.629172] __get_vm_area_node.isra.54+0xa7/0x150
[ 24.629195] ? fix_size_alloc_test+0x50/0x50 [test_vmalloc]
[ 24.629221] __vmalloc_node_range+0x64/0x230
[ 24.629241] ? test_func+0xdb/0x1f0 [test_vmalloc]
[ 24.629263] ? fix_size_alloc_test+0x50/0x50 [test_vmalloc]
[ 24.629288] __vmalloc_node+0x3b/0x40
[ 24.629305] ? test_func+0xdb/0x1f0 [test_vmalloc]
[ 24.629326] align_shift_alloc_test+0x39/0x50 [test_vmalloc]
[ 24.629350] test_func+0xdb/0x1f0 [test_vmalloc]
[ 24.629372] ? fix_align_alloc_test+0x50/0x50 [test_vmalloc]
[ 24.629396] kthread+0x13d/0x160
[ 24.629413] ? kthread_park+0x80/0x80
[ 24.629431] ret_from_fork+0x22/0x30
<snip>

--
Vlad Rezki