Re: [PATCH 6/6] perf thread-stack: Hide x86 retpolines

From: Adrian Hunter
Date: Thu Jan 10 2019 - 02:53:56 EST


On 9/01/19 5:38 PM, Jiri Olsa wrote:
> On Wed, Jan 09, 2019 at 11:18:35AM +0200, Adrian Hunter wrote:
>> x86 retpoline functions pollute the call graph by showing up everywhere
>> there is an indirect branch, but they do not really mean anything. Make
>> changes so that the default retpoline functions will no longer appear in
>> the call graph. Note this only affects the call graph, since all the
>> original branches are left unchanged.
>>
>> This does not handle function return thunks, nor is there any improvement
>> for the handling of inline thunks or extern thunks.
>>
>> Example:
>>
>> $ cat simple-retpoline.c
>> __attribute__((noinline)) int bar(void)
>> {
>> return -1;
>> }
>>
>> int foo(void)
>> {
>> return bar() + 1;
>> }
>>
>> __attribute__((indirect_branch("thunk"))) int main()
>> {
>> int (*volatile fn)(void) = foo;
>>
>> fn();
>> return fn();
>> }
>> $ gcc -ggdb3 -Wall -Wextra -O2 -o simple-retpoline simple-retpoline.c
>> $ objdump -d simple-retpoline
>> <SNIP>
>> 0000000000001040 <main>:
>> 1040: 48 83 ec 18 sub $0x18,%rsp
>> 1044: 48 8d 05 25 01 00 00 lea 0x125(%rip),%rax # 1170 <foo>
>> 104b: 48 89 44 24 08 mov %rax,0x8(%rsp)
>> 1050: 48 8b 44 24 08 mov 0x8(%rsp),%rax
>> 1055: e8 1f 01 00 00 callq 1179 <__x86_indirect_thunk_rax>
>> 105a: 48 8b 44 24 08 mov 0x8(%rsp),%rax
>> 105f: 48 83 c4 18 add $0x18,%rsp
>> 1063: e9 11 01 00 00 jmpq 1179 <__x86_indirect_thunk_rax>
>> <SNIP>
>> 0000000000001160 <bar>:
>> 1160: b8 ff ff ff ff mov $0xffffffff,%eax
>> 1165: c3 retq
>> <SNIP>
>> 0000000000001170 <foo>:
>> 1170: e8 eb ff ff ff callq 1160 <bar>
>> 1175: 83 c0 01 add $0x1,%eax
>> 1178: c3 retq
>> 0000000000001179 <__x86_indirect_thunk_rax>:
>> 1179: e8 07 00 00 00 callq 1185 <__x86_indirect_thunk_rax+0xc>
>> 117e: f3 90 pause
>> 1180: 0f ae e8 lfence
>> 1183: eb f9 jmp 117e <__x86_indirect_thunk_rax+0x5>
>> 1185: 48 89 04 24 mov %rax,(%rsp)
>> 1189: c3 retq
>> <SNIP>
>> $ perf record -o simple-retpoline.perf.data -e intel_pt/cyc/u ./simple-retpoline
>> [ perf record: Woken up 1 times to write data ]
>> [ perf record: Captured and wrote 0,017 MB simple-retpoline.perf.data ]
>> $ perf script -i simple-retpoline.perf.data --itrace=be -s ~/libexec/perf-core/scripts/python/export-to-sqlite.py simple-retpoline.db branches calls
>> 2019-01-08 14:03:37.851655 Creating database...
>> 2019-01-08 14:03:37.863256 Writing records...
>> 2019-01-08 14:03:38.069750 Adding indexes
>> 2019-01-08 14:03:38.078799 Done
>> $ ~/libexec/perf-core/scripts/python/exported-sql-viewer.py simple-retpoline.db
>>
>> Before:
>>
>> main
>> -> __x86_indirect_thunk_rax
>> -> __x86_indirect_thunk_rax
>> -> foo
>> -> bar
>>
>> After:
>>
>> main
>> -> foo
>> -> bar
>>
>
> hi,
> I still see them in kernel report.. after a while the output is shifted
> too far right out of the screen due to (I guess) this call/ret imbalance
>
>
> [root@krava perf]# ./perf-with-kcore record pt -e intel_pt/cyc/k -a -- sleep 1
> [root@krava perf]# ./perf-with-kcore script pt --call-trace -C 0
> Xorg 2115 [000] 5850.032452321: cbr: 36 freq: 3621 MHz (171%)
> Xorg 2115 [000] 5850.032452351: ([kernel.kallsyms]) __list_add_valid
> Xorg 2115 [000] 5850.032452380: ([kernel.kallsyms]) perf_pmu_enable.part.101
> Xorg 2115 [000] 5850.032452380: ([kernel.kallsyms]) __indirect_thunk_start
> Xorg 2115 [000] 5850.032452415: ([kernel.kallsyms]) __indirect_thunk_start
> Xorg 2115 [000] 5850.032452415: ([kernel.kallsyms]) __indirect_thunk_start
> Xorg 2115 [000] 5850.032452442: ([kernel.kallsyms]) __list_add_valid
> Xorg 2115 [000] 5850.032452457: ([kernel.kallsyms]) rb_next
> Xorg 2115 [000] 5850.032452494: ([kernel.kallsyms]) __x86_indirect_thunk_r12
> Xorg 2115 [000] 5850.032452494: ([kernel.kallsyms]) __x86_indirect_thunk_r12
> Xorg 2115 [000] 5850.032452535: ([kernel.kallsyms]) group_sched_in
> Xorg 2115 [000] 5850.032452535: ([kernel.kallsyms]) __indirect_thunk_start
> Xorg 2115 [000] 5850.032452535: ([kernel.kallsyms]) __indirect_thunk_start
> Xorg 2115 [000] 5850.032452548: ([kernel.kallsyms]) event_sched_in.isra.113
> Xorg 2115 [000] 5850.032452548: ([kernel.kallsyms]) perf_event_set_state.part.70
> Xorg 2115 [000] 5850.032452548: ([kernel.kallsyms]) perf_event_update_time
> Xorg 2115 [000] 5850.032452562: ([kernel.kallsyms]) perf_log_itrace_start
> Xorg 2115 [000] 5850.032452562: ([kernel.kallsyms]) __indirect_thunk_start
> Xorg 2115 [000] 5850.032452562: ([kernel.kallsyms]) __indirect_thunk_start
> Xorg 2115 [000] 5850.032452600: ([kernel.kallsyms]) perf_event_update_userpage
> Xorg 2115 [000] 5850.032452600: ([kernel.kallsyms]) calc_timer_values
> Xorg 2115 [000] 5850.032452600: ([kernel.kallsyms]) sched_clock_cpu
> Xorg 2115 [000] 5850.032452600: ([kernel.kallsyms]) sched_clock
> Xorg 2115 [000] 5850.032452600: ([kernel.kallsyms]) native_sched_clock
> Xorg 2115 [000] 5850.032452793: ([kernel.kallsyms]) __indirect_thunk_start
> Xorg 2115 [000] 5850.032452793: ([kernel.kallsyms]) __indirect_thunk_start
> Xorg 2115 [000] 5850.032452867: ([kernel.kallsyms]) arch_perf_update_userpage
> Xorg 2115 [000] 5850.032452867: ([kernel.kallsyms]) using_native_sched_clock
> Xorg 2115 [000] 5850.032452867: ([kernel.kallsyms]) sched_clock_stable
> Xorg 2115 [000] 5850.032452867: ([kernel.kallsyms]) cyc2ns_read_begin
> Xorg 2115 [000] 5850.032453157: ([kernel.kallsyms]) cyc2ns_read_end
> Xorg 2115 [000] 5850.032453194: ([kernel.kallsyms]) __indirect_thunk_start
> Xorg 2115 [000] 5850.032453194: ([kernel.kallsyms]) __indirect_thunk_start
> Xorg 2115 [000] 5850.032453206: ([kernel.kallsyms]) __list_add_valid
>
>
> jirka
>

These patches are for db export and the context-sensitive call chain report
in the exported-sql-viewer.py python script. The callchains you see above
are done at a lower level, before symbol resolution. If you don't want to
see the thunks you can grep them out. But more work is needed on them too.
I'll add it to the TODO list.