Re: [PATCHv4 14/14] x86/mm: Offset boot-time paging mode switching cost

From: Ingo Molnar
Date: Thu Aug 17 2017 - 05:22:00 EST



* Kirill A. Shutemov <kirill.shutemov@xxxxxxxxxxxxxxx> wrote:

> By this point we have functioning boot-time switching between 4- and
> 5-level paging mode. But naive approach comes with cost.
>
> Numbers below are for kernel build, allmodconfig, 5 times.
>
> CONFIG_X86_5LEVEL=n:
>
> Performance counter stats for 'sh -c make -j100 -B -k >/dev/null' (5 runs):
>
> 17308719.892691 task-clock:u (msec) # 26.772 CPUs utilized ( +- 0.11% )
> 0 context-switches:u # 0.000 K/sec
> 0 cpu-migrations:u # 0.000 K/sec
> 331,993,164 page-faults:u # 0.019 M/sec ( +- 0.01% )
> 43,614,978,867,455 cycles:u # 2.520 GHz ( +- 0.01% )
> 39,371,534,575,126 stalled-cycles-frontend:u # 90.27% frontend cycles idle ( +- 0.09% )
> 28,363,350,152,428 instructions:u # 0.65 insn per cycle
> # 1.39 stalled cycles per insn ( +- 0.00% )
> 6,316,784,066,413 branches:u # 364.948 M/sec ( +- 0.00% )
> 250,808,144,781 branch-misses:u # 3.97% of all branches ( +- 0.01% )
>
> 646.531974142 seconds time elapsed ( +- 1.15% )
>
> CONFIG_X86_5LEVEL=y:
>
> Performance counter stats for 'sh -c make -j100 -B -k >/dev/null' (5 runs):
>
> 17411536.780625 task-clock:u (msec) # 26.426 CPUs utilized ( +- 0.10% )
> 0 context-switches:u # 0.000 K/sec
> 0 cpu-migrations:u # 0.000 K/sec
> 331,868,663 page-faults:u # 0.019 M/sec ( +- 0.01% )
> 43,865,909,056,301 cycles:u # 2.519 GHz ( +- 0.01% )
> 39,740,130,365,581 stalled-cycles-frontend:u # 90.59% frontend cycles idle ( +- 0.05% )
> 28,363,358,997,959 instructions:u # 0.65 insn per cycle
> # 1.40 stalled cycles per insn ( +- 0.00% )
> 6,316,784,937,460 branches:u # 362.793 M/sec ( +- 0.00% )
> 251,531,919,485 branch-misses:u # 3.98% of all branches ( +- 0.00% )
>
> 658.886307752 seconds time elapsed ( +- 0.92% )
> The patch tries to fix the performance regression by using
>
> !cpu_feature_enabled(X86_FEATURE_LA57) instead of p4d_folded in all hot
> code paths. These will statically patch the target code for additional
> performance.
>
> Also, I had to re-write number of static inline helpers as macros.
> It was needed to break header dependency loop between cpufeature.h and
> pgtable_types.h.
>
> CONFIG_X86_5LEVEL=y + the patch:
>
> Performance counter stats for 'sh -c make -j100 -B -k >/dev/null' (5 runs):
>
> 17381990.268506 task-clock:u (msec) # 26.907 CPUs utilized ( +- 0.19% )
> 0 context-switches:u # 0.000 K/sec
> 0 cpu-migrations:u # 0.000 K/sec
> 331,862,625 page-faults:u # 0.019 M/sec ( +- 0.01% )
> 43,697,726,320,051 cycles:u # 2.514 GHz ( +- 0.03% )
> 39,480,408,690,401 stalled-cycles-frontend:u # 90.35% frontend cycles idle ( +- 0.05% )
> 28,363,394,221,388 instructions:u # 0.65 insn per cycle
> # 1.39 stalled cycles per insn ( +- 0.00% )
> 6,316,794,985,573 branches:u # 363.410 M/sec ( +- 0.00% )
> 251,013,232,547 branch-misses:u # 3.97% of all branches ( +- 0.01% )
>
> 645.991174661 seconds time elapsed ( +- 1.19% )

Ok - these measurements are very nice and address many of my worries about earlier
parts of the series.

Anyway, please split this patch up some more as well (as any of the optimizations
could regress by themselves), and my renaming suggestions still stand as well.

> @@ -11,6 +11,11 @@
> #undef CONFIG_PARAVIRT_SPINLOCKS
> #undef CONFIG_KASAN
>
> +#ifdef CONFIG_X86_5LEVEL
> +/* cpu_feature_enabled() cannot be used that early */
> +#define p4d_folded __p4d_folded
> +#endif
> +
> #include <linux/linkage.h>
> #include <linux/screen_info.h>
> #include <linux/elf.h>
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index 077e8b45784c..702a1feb4991 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -274,15 +274,8 @@ return_from_SYSCALL_64:
> * depending on paging mode) in the address.
> */
> #ifdef CONFIG_X86_5LEVEL
> - testl $1, p4d_folded(%rip)
> - jnz 1f
> - shl $(64 - 57), %rcx
> - sar $(64 - 57), %rcx
> - jmp 2f
> -1:
> - shl $(64 - 48), %rcx
> - sar $(64 - 48), %rcx
> -2:
> + ALTERNATIVE "shl $(64 - 48), %rcx; sar $(64 - 48), %rcx", \
> + "shl $(64 - 57), %rcx; sar $(64 - 57), %rcx", X86_FEATURE_LA57

Ignore my earlier suggestion to use alternatives, you already implemented it!
This is what I get for replying to a patch series in chronological order. ;-)

I suspect the syscall overhead was the main reason for the performance regression.

Thanks,

Ingo