[RFC 0/4] perf record: Implement off-cpu profiling with BPF (v2)

From: Namhyung Kim
Date: Fri May 06 2022 - 16:16:37 EST


Hello,

This is the second version of off-cpu profiling support. Together with
(PMU-based) cpu profiling, it can show holistic view of the performance
characteristics of your application or system.

Changes in v2)
* change sched_switch argument handling (Andrii)
* use task local storage (Hao)
* fix build error on !BUILD_BPF_SKEL (kernel test robot)
* add documentation regard fp callstack (Milian)


With BPF, it can aggregate scheduling stats for interested tasks
and/or states and convert the data into a form of perf sample records.
I chose the bpf-output event which is a software event supposed to be
consumed by BPF programs and renamed it as "offcpu-time". So it
requires no change on the perf report side except for setting sample
types of bpf-output event.

Basically it collects userspace callstack for tasks as it's what users
want mostly. Maybe we can add support for the kernel stacks but I'm
afraid that it'd cause more overhead. So the offcpu-time event will
always have callchains regardless of the command line option, and it
enables the children mode in perf report by default.

It adds --off-cpu option to perf record like below:

$ sudo perf record -a --off-cpu -- perf bench sched messaging -l 1000
# Running 'sched/messaging' benchmark:
# 20 sender and receiver processes per group
# 10 groups == 400 processes run

Total time: 1.518 [sec]
[ perf record: Woken up 9 times to write data ]
[ perf record: Captured and wrote 5.313 MB perf.data (53341 samples) ]

Then we can run perf report as usual. The below is just to skip less
important parts.

$ sudo perf report --stdio --call-graph=no --percent-limit=2
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 52K of event 'cycles'
# Event count (approx.): 42522453276
#
# Children Self Command Shared Object Symbol
# ........ ........ ............... ................ ..................................
#
9.58% 9.58% sched-messaging [kernel.vmlinux] [k] audit_filter_rules.constprop.0
8.46% 8.46% sched-messaging [kernel.vmlinux] [k] audit_filter_syscall
4.54% 4.54% sched-messaging [kernel.vmlinux] [k] copy_user_enhanced_fast_string
2.94% 2.94% sched-messaging [kernel.vmlinux] [k] unix_stream_read_generic
2.45% 2.45% sched-messaging [kernel.vmlinux] [k] memcg_slab_free_hook


# Samples: 983 of event 'offcpu-time'
# Event count (approx.): 684538813464
#
# Children Self Command Shared Object Symbol
# ........ ........ ............... .................... ..........................
#
83.86% 0.00% sched-messaging libc-2.33.so [.] __libc_start_main
83.86% 0.00% sched-messaging perf [.] cmd_bench
83.86% 0.00% sched-messaging perf [.] main
83.86% 0.00% sched-messaging perf [.] run_builtin
83.64% 0.00% sched-messaging perf [.] bench_sched_messaging
41.35% 41.35% sched-messaging libpthread-2.33.so [.] __read
38.88% 38.88% sched-messaging libpthread-2.33.so [.] __write
3.41% 3.41% sched-messaging libc-2.33.so [.] __poll

The perf bench sched messaging created 400 processes to send/receive
messages through unix sockets. It spent a large portion of cpu cycles
for audit filter and read/copy the messages while most of the
offcpu-time was in read and write calls.

You can get the code from 'perf/offcpu-v2' branch in my tree at

git://git.kernel.org/pub/scm/linux/kernel/git/namhyung/linux-perf.git

Enjoy! :)

Thanks,
Namhyung


Namhyung Kim (4):
perf report: Do not extend sample type of bpf-output event
perf record: Enable off-cpu analysis with BPF
perf record: Implement basic filtering for off-cpu
perf record: Handle argument change in sched_switch

tools/perf/Documentation/perf-record.txt | 10 +
tools/perf/Makefile.perf | 1 +
tools/perf/builtin-record.c | 21 ++
tools/perf/util/Build | 1 +
tools/perf/util/bpf_off_cpu.c | 298 +++++++++++++++++++++++
tools/perf/util/bpf_skel/off_cpu.bpf.c | 209 ++++++++++++++++
tools/perf/util/evsel.c | 4 +-
tools/perf/util/off_cpu.h | 24 ++
8 files changed, 566 insertions(+), 2 deletions(-)
create mode 100644 tools/perf/util/bpf_off_cpu.c
create mode 100644 tools/perf/util/bpf_skel/off_cpu.bpf.c
create mode 100644 tools/perf/util/off_cpu.h


base-commit: 33cd6928039c6bf18cf0baec936924d908e6c89b
--
2.36.0.512.ge40c2bad7a-goog