RE: [PATCH v2] RFC: clear 1G pages with streaming stores on x86

From: Elliott, Robert (Persistent Memory)
Date: Wed Jul 25 2018 - 01:02:54 EST




> -----Original Message-----
> From: linux-kernel-owner@xxxxxxxxxxxxxxx <linux-kernel-
> owner@xxxxxxxxxxxxxxx> On Behalf Of Cannon Matthews
> Sent: Tuesday, July 24, 2018 9:37 PM
> Subject: Re: [PATCH v2] RFC: clear 1G pages with streaming stores on
> x86
>
> Reimplement clear_gigantic_page() to clear gigabytes pages using the
> non-temporal streaming store instructions that bypass the cache
> (movnti), since an entire 1GiB region will not fit in the cache
> anyway.
>
> Doing an mlock() on a 512GiB 1G-hugetlb region previously would take
> on average 134 seconds, about 260ms/GiB which is quite slow. Using
> `movnti` and optimizing the control flow over the constituent small
> pages, this can be improved roughly by a factor of 3-4x, with the
> 512GiB mlock() taking only 34 seconds on average, or 67ms/GiB.

...
> - Are there any obvious pitfalls or caveats that have not been
> considered?

Note that Kirill attempted something like this in 2012 - see
https://www.spinics.net/lists/linux-mm/msg40575.html

...
> +++ b/arch/x86/lib/clear_gigantic_page.c
> @@ -0,0 +1,29 @@
> +#include <asm/page.h>
> +
> +#include <linux/kernel.h>
> +#include <linux/mm.h>
> +#include <linux/sched.h>
> +
> +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) ||
> defined(CONFIG_HUGETLBFS)
> +#define PAGES_BETWEEN_RESCHED 64
> +void clear_gigantic_page(struct page *page,
> + unsigned long addr,

The previous attempt used cacheable stores in the page containing
addr to prevent an inevitable cache miss after the clearing completes.
This function is not using addr at all.

> + unsigned int pages_per_huge_page)
> +{
> + int i;
> + void *dest = page_to_virt(page);
> + int resched_count = 0;
> +
> + BUG_ON(pages_per_huge_page % PAGES_BETWEEN_RESCHED != 0);
> + BUG_ON(!dest);

Are those really possible conditions? Is there a safer fallback
than crashing the whole kernel?

> +
> + for (i = 0; i < pages_per_huge_page; i +=
> PAGES_BETWEEN_RESCHED) {
> + __clear_page_nt(dest + (i * PAGE_SIZE),
> + PAGES_BETWEEN_RESCHED * PAGE_SIZE);
> + resched_count += cond_resched();
> + }
> + /* __clear_page_nt requrires and `sfence` barrier. */

requires an

...
> diff --git a/arch/x86/lib/clear_page_64.S
...
> +/*
> + * Zero memory using non temporal stores, bypassing the cache.
> + * Requires an `sfence` (wmb()) afterwards.
> + * %rdi - destination.
> + * %rsi - page size. Must be 64 bit aligned.
> +*/
> +ENTRY(__clear_page_nt)
> + leaq (%rdi,%rsi), %rdx
> + xorl %eax, %eax
> + .p2align 4,,10
> + .p2align 3
> +.L2:
> + movnti %rax, (%rdi)
> + addq $8, %rdi

Also consider using the AVX vmovntdq instruction (if available), the
most recent of which does 64-byte (cache line) sized transfers to
zmm registers. There's a hefty context switching overhead (e.g.,
304 clocks), but it might be worthwhile for 1 GiB (which
is 16,777,216 cache lines).

glibc memcpy() makes that choice for transfers > 75% of the L3 cache
size divided by the number of cores. (last I tried, it was still
selecting "rep stosb" for large memset()s, although it has an
AVX-512 function available)

Even with that, one CPU core won't saturate the memory bus; multiple
CPU cores (preferably on the same NUMA node as the memory) need to
share the work.

---
Robert Elliott, HPE Persistent Memory