Re: [PATCH] x86/asm/entry/64: pack interrupt dispatch table tighter

From: Ingo Molnar
Date: Fri Apr 03 2015 - 10:03:49 EST

Next message: Quentin Casasnovas: "Re: [PATCH] x86/xsave: Robustify and merge macros"
Previous message: Georgi Djakov: "[PATCH] clk: qcom: Fix parent_map translations"
In reply to: Denys Vlasenko: "[PATCH] x86/asm/entry/64: pack interrupt dispatch table tighter"
Next in thread: Denys Vlasenko: "Re: [PATCH] x86/asm/entry/64: pack interrupt dispatch table tighter"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

* Denys Vlasenko <dvlasenk@xxxxxxxxxx> wrote:

> Interrupt entry points are handled with the following code:
> Each 32-byte code block contains seven entry points
>
> ...
> [push][jump 22] // 4 bytes
> [push][jump 18] // 4 bytes
> [push][jump 14] // 4 bytes
> [push][jump 10] // 4 bytes
> [push][jump 6] // 4 bytes
> [push][jump 2] // 4 bytes
> [push][jump common_interrupt][padding] // 8 bytes
>
> [push][jump]
> [push][jump]
> [push][jump]
> [push][jump]
> [push][jump]
> [push][jump]
> [push][jump common_interrupt][padding]
>
> [padding_2]
> common_interrupt:
>
> The first six jumps are short (2-byte insns) jumps to the last
> jump in the block. The last jump usually has to be a longer, 5-byte form.
>
> There are about 30 such blocks.
>
> This change uses the fact that last few of these blocks are close to
> "common_interrupt" label and the last jump there can also be a short one.
>
> This allows several last 32-byte code blocks to contain eight, not seven entry points,
> and to have all these entry points jump directly to common_interrupt,
> eliminating one branch. They look like this:
>
> [push][jump common_interrupt] // 4 bytes
> [push][jump common_interrupt] // 4 bytes
> [push][jump common_interrupt] // 4 bytes
> [push][jump common_interrupt] // 4 bytes
> [push][jump common_interrupt] // 4 bytes
> [push][jump common_interrupt] // 4 bytes
> [push][jump common_interrupt] // 4 bytes
> [push][jump common_interrupt] // 4 bytes
>
> This change replaces ".p2align CONFIG_X86_L1_CACHE_SHIFT" before dispatch table
> and before common_interrupt with 32-byte alignment.
>
> The first alignment was completely unnecessary:
> the dispatch table will not benefit from any alignment bigger than 32 bytes
> (any entry point needs at most only 32 bytes of table to be read by the CPU).
>
> The alignment before common_interrupt label affects the size of padding
> (the "[padding_2]" above) and excessive padding reduces the number of blocks
> which can be optimized. In the .config I tested with, this alignment was 64 bytes,
> this means two fewer blocks could be optimized. I believe 32-byte alignment
> should do just fine.
>
> The condition which controls the choice of a direct jump versus short one,
>
> /* Can we reach common_interrupt with a short jump? */
> .if vector >= (FIRST_SYSTEM_VECTOR - (128-ALIGN_common_interrupt)/4)
>
> was tested to be a passimistic one: the generated code will not erroneously
> generate a longer than necessary jump even with maximum possible padding
> and with -1 subtracted from the right hand side above.
> This was verified by checking disassembly.
>
> However, with current value of FIRST_SYSTEM_VECTOR it so happens
> that the padding before "common_interrupt" is zero bytes.
>
> Disassembly before After
> ========================================== ==================================
> ... ...
> 840 6a9b push $0xffffffffffffff9b 82c 6a9d push $0xffffffffffffff9d
> 842 eb16 jmp 85a 82e eb30 jmp 860
> 844 6a9a push $0xffffffffffffff9a 830 6a9c push $0xffffffffffffff9c
> 846 eb12 jmp 85a 832 eb2c jmp 860
> 848 6a99 push $0xffffffffffffff99 834 6a9b push $0xffffffffffffff9b
> 84a eb0e jmp 85a 836 eb28 jmp 860
> 84c 6a98 push $0xffffffffffffff98 838 6a9a push $0xffffffffffffff9a
> 84e eb0a jmp 85a 83a eb24 jmp 860
> 850 6a97 push $0xffffffffffffff97 83c 6a99 push $0xffffffffffffff99
> 852 eb06 jmp 85a 83e eb20 jmp 860
> 854 6a96 push $0xffffffffffffff96 840 6a98 push $0xffffffffffffff98
> 856 eb02 jmp 85a 842 eb1c jmp 860
> 858 6a95 push $0xffffffffffffff95 844 6a97 push $0xffffffffffffff97
> 85a eb24 jmp 880 846 eb18 jmp 860
> 85c 0f1f4000 nopl 0x0(%rax) 848 6a96 push $0xffffffffffffff96
> 860 6a94 push $0xffffffffffffff94 84a eb14 jmp 860
> 862 eb0c jmp 870 84c 6a95 push $0xffffffffffffff95
> 864 6a93 push $0xffffffffffffff93 84e eb10 jmp 860
> 866 eb08 jmp 870 850 6a94 push $0xffffffffffffff94
> 868 6a92 push $0xffffffffffffff92 852 eb0c jmp 860
> 86a eb04 jmp 870 854 6a93 push $0xffffffffffffff93
> 86c 6a91 push $0xffffffffffffff91 856 eb08 jmp 860
> 86e eb00 jmp 870 858 6a92 push $0xffffffffffffff92
> 870 eb0e jmp 880 85a eb04 jmp 860
> 872 66666666662e0f data32 data32 data32 data3285c 6a91 push $0xffffffffffffff91
> 879 1f840000000000 nopw %cs:0x0(%rax,%rax,1) 85e eb00 jmp 860
> 880 <common_interrupt>: 860 <common_interrupt>:
>
> Sizes:
> text data bss dec hex filename
> 12578 0 0 12578 3122 entry_64.o.before
> 12546 0 0 12546 3102 entry_64.o
>
> Signed-off-by: Denys Vlasenko <dvlasenk@xxxxxxxxxx>
> CC: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
> CC: Steven Rostedt <rostedt@xxxxxxxxxxx>
> CC: Ingo Molnar <mingo@xxxxxxxxxx>
> CC: Borislav Petkov <bp@xxxxxxxxx>
> CC: "H. Peter Anvin" <hpa@xxxxxxxxx>
> CC: Andy Lutomirski <luto@xxxxxxxxxxxxxx>
> CC: Oleg Nesterov <oleg@xxxxxxxxxx>
> CC: Frederic Weisbecker <fweisbec@xxxxxxxxx>
> CC: Alexei Starovoitov <ast@xxxxxxxxxxxx>
> CC: Will Drewry <wad@xxxxxxxxxxxx>
> CC: Kees Cook <keescook@xxxxxxxxxxxx>
> CC: x86@xxxxxxxxxx
> CC: linux-kernel@xxxxxxxxxxxxxxx
> ---
> arch/x86/kernel/entry_64.S | 61 +++++++++++++++++++++++++++++++++++++---------
> 1 file changed, 49 insertions(+), 12 deletions(-)
>
> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> index da137f9..2003417 100644
> --- a/arch/x86/kernel/entry_64.S
> +++ b/arch/x86/kernel/entry_64.S
> @@ -537,33 +537,70 @@ END(ret_from_fork)
> * Build the entry stubs and pointer table with some assembler magic.
> * We pack 7 stubs into a single 32-byte chunk, which will fit in a
> * single cache line on all modern x86 implementations.
> + * The last few cachelines pack 8 stubs each.
> */
> +ALIGN_common_interrupt=32
> .section .init.rodata,"a"
> ENTRY(interrupt)
> .section .entry.text
> - .p2align 5
> - .p2align CONFIG_X86_L1_CACHE_SHIFT
> + .align 32
> ENTRY(irq_entries_start)
> INTR_FRAME
> vector=FIRST_EXTERNAL_VECTOR
> -.rept (FIRST_SYSTEM_VECTOR-FIRST_EXTERNAL_VECTOR+6)/7
> - .balign 32
> +.rept 256 /* this number does not need to be exact, just big enough */
> +
> + /*
> + * Block of six "push + short_jump" pairs, 4 bytes each,
> + * and a 2-byte seventh "push", without its jump yet.
> + */
> +need_near_jump=0
> +push_count=0
> .rept 7
> .if vector < FIRST_SYSTEM_VECTOR
> - .if vector <> FIRST_EXTERNAL_VECTOR
> +1: pushq_cfi $(~vector+0x80) /* Note: always in signed byte range */
> + .previous
> + .quad 1b
> + .section .entry.text
> +vector=vector+1
> +push_count=push_count+1
> + .if push_count < 7
> + /* Can we reach common_interrupt with a short jump? */
> + .if vector >= (FIRST_SYSTEM_VECTOR - (128-ALIGN_common_interrupt)/4)
> + jmp common_interrupt /* yes */
> + .else
> + jmp 2f
> +need_near_jump=1
> + .endif
> CFI_ADJUST_CFA_OFFSET -8
> .endif
> + .endif
> + .endr
> +
> + /* If we are close to the end, we can pack in yet another pair */
> + .if need_near_jump == 0
> + .if push_count == 7
> + /* The "short jump" for the seventh "push" */
> + jmp common_interrupt
> + CFI_ADJUST_CFA_OFFSET -8
> + .endif
> + .if vector < FIRST_SYSTEM_VECTOR
> + /* "push + short_jump" pair #8 */
> 1: pushq_cfi $(~vector+0x80) /* Note: always in signed byte range */
> - .if ((vector-FIRST_EXTERNAL_VECTOR)%7) <> 6
> - jmp 2f
> - .endif
> - .previous
> + .previous
> .quad 1b
> - .section .entry.text
> + .section .entry.text
> vector=vector+1
> +push_count=push_count+1
> + jmp common_interrupt
> + CFI_ADJUST_CFA_OFFSET -8
> .endif
> - .endr
> + .else
> + /* Use remaining space for "near jump" for the seventh "push" */
> 2: jmp common_interrupt
> + CFI_ADJUST_CFA_OFFSET -8
> + .align 2
> + .endif
> +
> .endr
> CFI_ENDPROC
> END(irq_entries_start)

So I think this might be easier to read if we went low tech and used
an open-coded 240+ lines assembly file for this, with a few
obvious-purpose helper macros? It's not like it will change all that
often so it only has to be generated once.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Quentin Casasnovas: "Re: [PATCH] x86/xsave: Robustify and merge macros"
Previous message: Georgi Djakov: "[PATCH] clk: qcom: Fix parent_map translations"
In reply to: Denys Vlasenko: "[PATCH] x86/asm/entry/64: pack interrupt dispatch table tighter"
Next in thread: Denys Vlasenko: "Re: [PATCH] x86/asm/entry/64: pack interrupt dispatch table tighter"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]