Re: [PATCH 00/13] perf_events: add support for sampling takenbranches (v3)

From: Stephane Eranian
Date: Mon Jan 23 2012 - 05:14:40 EST

Next message: PINTU KUMAR: "[Help] : RSS/PSS showing 0 during smaps for Xorg"
Previous message: Saugata Das: "Re: [PATCH v3 0/2] mmc: core: Support packed command feature of eMMC4.5"
Next in thread: Peter Zijlstra: "Re: [PATCH 00/13] perf_events: add support for sampling takenbranches (v3)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Any comments on this patch set?

On Mon, Jan 9, 2012 at 5:49 PM, Stephane Eranian <eranian@xxxxxxxxxx> wrote:
>
> This patchset adds an important and useful new feature to
> perf_events: branch stack sampling. In other words, the
> ability to capture taken branches into each sample.
>
> Statistical sampling of taken branch should not be confused
> for branch tracing. Not all branches are necessarily captured
>
> Sampling taken branches is important for basic block profiling,
> statistical call graph, function call counts. Many of those
> measurements can help drive a compiler optimizer.
>
> The branch stack is a software abstraction which sits on top
> of the PMU hardware. As such, it is not available on all
> processors. For now, the patch provides the generic interface
> and the Intel X86 implementation where it leverages the Last
> Branch Record (LBR) feature (from Core2 to SandyBridge).
>
> Branch stack sampling is supported for both per-thread and
> system-wide modes.
>
> It is possible to filter the type and privilege level of branches
> to sample. The target of the branch is used to determine
> the privilege level.
>
> For each branch, the source and destination are captured. On
> some hardware platforms, it may be possible to also extract
> the target prediction and, in that case, it is also exposed
> to end users.
>
> The branch stack can record a variable number of taken
> branches per sample. Those branches are always consecutive
> in time. The number of branches captured depends on the
> filtering and the underlying hardware. On Intel Nehalem
> and later, up to 16 consecutive branches can be captured
> per sample.
>
> Branch sampling is always coupled with an event. It can
> be any PMU event but it can't be a SW or tracepoint event.
>
> Branch sampling is requested by setting a new sample_type
> flag called: PERF_SAMPLE_BRANCH_STACK.
>
> To support branch filtering, we introduce a new field
> to the perf_event_attr struct: branch_sample_type. We chose
> NOT to overload the config1, config2 field because those
> are related to the event encoding. Branch stack is a
> separate feature which is combined with the event.
>
> The branch_sample_type is a bitmask of possible filters.
> The following filters are defined (more can be added):
> - PERF_SAMPLE_BRANCH_ANY Â Â : any control flow change
> - PERF_SAMPLE_BRANCH_USER Â Â: capture branches when target is at user level
> - PERF_SAMPLE_BRANCH_KERNEL Â: capture branches when target is at user level
> - PERF_SAMPLE_BRANCH_ANY_CALL: capture call branches (incl. syscalls)
> - PERF_SAMPLE_BRANCH_ANY_RET : capture return branches (incl. syscall returns)
> - PERF_SAMPLE_BRANCH_IND_CALL: capture indirect calls
>
> It is possible to combine filters, e.g., IND_CALL|USER|KERNEL.
>
> When the privilege level is not specified, the branch stack
> inherits that of the associated event.
>
> Some processors may not offer hardware branch filtering, e.g., Intel
> Atom. Some may have HW filtering bugs (e.g., Nehalem). The Intel
> X86 implementation in this patchset also provides a SW branch filter
> which works on a best effort basis. It can compensate for the lack
> of LBR filtering. But first and foremost, it helps work around LBR
> filtering errata. The goal is to only capture the type of branches
> requested by the user.
>
> It is possible to combine branch stack sampling with PEBS on Intel
> X86 processors. Depending on the precise_sampling mode, there are
> certain filterting restrictions. When precise_sampling=1, then
> there are no filtering restrictions. When precise_sampling > 1,
> then only ANY|USER|KERNEL filter can be used. This comes from
> the fact that the kernel uses LBR to compensate for the PEBS
> off-by-1 skid on the instruction pointer.
>
> To demonstrate how the perf_event branch stack sampling interface
> works, the patchset also modifies perf record to capture taken
> branches. Similarly perf report is enhanced to display a histogram
> of taken branches.
>
> I would like to thank Roberto Vitillo @ LBL for his work on the perf
> tool for this.
>
> Enough talking, let's take a simple example. Our trivial test program
> goes like this:
>
> void f2(void)
> {}
> void f3(void)
> {}
> void f1(unsigned long n)
> {
> Âif (n & 1UL)
> Â Âf2();
> Âelse
> Â Âf3();
> }
> int main(void)
> {
> Âunsigned long i;
>
> Âfor (i=0; i < N; i++)
> Â f1(i);
> Âreturn 0;
> }
>
> $ perf record -b any branchy
> $ perf report -b
> # Events: 23K cycles
> #
> # Overhead ÂSource Symbol Â Â Target Symbol
> # ........ Â................ Â................
>
> Â Â18.13% Â[.] f1 Â Â Â Â Â Â[.] main
> Â Â18.10% Â[.] main Â Â Â Â Â[.] main
> Â Â18.01% Â[.] main Â Â Â Â Â[.] f1
> Â Â15.69% Â[.] f1 Â Â Â Â Â Â[.] f1
> Â Â 9.11% Â[.] f3 Â Â Â Â Â Â[.] f1
> Â Â 6.78% Â[.] f1 Â Â Â Â Â Â[.] f3
> Â Â 6.74% Â[.] f1 Â Â Â Â Â Â[.] f2
> Â Â 6.71% Â[.] f2 Â Â Â Â Â Â[.] f1
>
> Of the total number of branches captured, 18.13% were from f1() -> main().
>
> Let's make this clearer by filtering the user call branches only:
>
> $ perf record -b any_call -e cycles:u branchy
> $ perf report -b
> # Events: 19K cycles
> #
> # Overhead ÂSource Symbol Â Â Â Â Â Â ÂTarget Symbol
> # ........ Â......................... Â.........................
> #
> Â Â52.50% Â[.] main Â Â Â Â Â Â Â Â Â [.] f1
> Â Â23.99% Â[.] f1 Â Â Â Â Â Â Â Â Â Â [.] f3
> Â Â23.48% Â[.] f1 Â Â Â Â Â Â Â Â Â Â [.] f2
> Â Â 0.03% Â[.] _IO_default_xsputn Â Â [.] _IO_new_file_overflow
> Â Â 0.01% Â[k] _start Â Â Â Â Â Â Â Â [k] __libc_start_main
>
> Now it is more obvious. %52 of all the captured branches where calls from main() -> f1().
> The rest is split 50/50 between f1() -> f2() and f1() -> f3() which is expected given
> that f1() dispatches based on odd vs. even values of n which is constantly increasing.
>
>
> Here is a kernel example, where we want to sample indirect calls:
> $ perf record -a -C 1 -b ind_call -e r1c4:k sleep 10
> $ perf report -b
> #
> # Overhead ÂSource Symbol Â Â Â Â Â Â Â Target Symbol
> # ........ Â.......................... Â..........................
> #
> Â Â36.36% Â[k] __delay Â Â Â Â Â Â Â Â [k] delay_tsc
> Â Â 9.09% Â[k] ktime_get Â Â Â Â Â Â Â [k] read_tsc
> Â Â 9.09% Â[k] getnstimeofday Â Â Â Â Â[k] read_tsc
> Â Â 9.09% Â[k] notifier_call_chain Â Â [k] tick_notify
> Â Â 4.55% Â[k] cpuidle_idle_call Â Â Â [k] intel_idle
> Â Â 4.55% Â[k] cpuidle_idle_call Â Â Â [k] menu_reflect
> Â Â 2.27% Â[k] handle_irq Â Â Â Â Â Â Â[k] handle_edge_irq
> Â Â 2.27% Â[k] ack_apic_edge Â Â Â Â Â [k] native_apic_mem_write
> Â Â 2.27% Â[k] hpet_interrupt_handler Â[k] hrtimer_interrupt
> Â Â 2.27% Â[k] __run_hrtimer Â Â Â Â Â [k] watchdog_timer_fn
> Â Â 2.27% Â[k] enqueue_task Â Â Â Â Â Â[k] enqueue_task_rt
> Â Â 2.27% Â[k] try_to_wake_up Â Â Â Â Â[k] select_task_rq_rt
> Â Â 2.27% Â[k] do_timer Â Â Â Â Â Â Â Â[k] read_tsc
>
> Due to HW limitations, branch filtering may be approximate on
> Core, Atom processors. It is more accurate on Nehalem, Westmere
> and best on Sandy Bridge.
>
> In version 2, we've updated the patch to tip/master (commit 5734857) and
> we've incoporated the feedback from v1 concerning anynous bitfield
> struct for branch_stack_entry and the hanlding of i386 ABI binaries
> on 64-bit host in the instr decoder for the LBR SW filter.
>
> In version 3, we've updated to 3.2.0-tip. The Atom revision
> check has been put into its own patch. We fixed a browser
> issue with report report. We fixed all the style issues as well.
>
> Signed-off-by: Stephane Eranian <eranian@xxxxxxxxxx>
> ---
>
> Roberto Agostino Vitillo (3):
> Âperf: add code to support PERF_SAMPLE_BRANCH_STACK
> Âperf: add support for sampling taken branch to perf record
> Âperf: add support for taken branch sampling to perf report
>
> Stephane Eranian (10):
> Âperf_events: add generic taken branch sampling support (v3)
> Âperf_events: add Intel LBR MSR definitions
> Âperf_events: add Intel X86 LBR sharing logic
> Âperf_events: sync branch stack sampling with X86 precise_sampling
> Âperf_events: add LBR mappings for PERF_SAMPLE_BRANCH filters
> Âperf_events: disable LBR support for older Intel Atom processors
> Âperf_events: implement PERF_SAMPLE_BRANCH for Intel X86
> Âperf_events: add LBR software filter support for Intel X86
> Âperf_events: disable PERF_SAMPLE_BRANCH_* when not supported
> Âperf_events: add hook to flush branch_stack on context switch
>
> Âarch/alpha/kernel/perf_event.c Â Â Â Â Â Â | Â Â4 +
> Âarch/arm/kernel/perf_event.c Â Â Â Â Â Â Â | Â Â4 +
> Âarch/mips/kernel/perf_event_mipsxx.c Â Â Â | Â Â4 +
> Âarch/powerpc/kernel/perf_event.c Â Â Â Â Â | Â Â4 +
> Âarch/sh/kernel/perf_event.c Â Â Â Â Â Â Â Â| Â Â4 +
> Âarch/sparc/kernel/perf_event.c Â Â Â Â Â Â | Â Â4 +
> Âarch/x86/include/asm/msr-index.h Â Â Â Â Â | Â Â7 +
> Âarch/x86/kernel/cpu/perf_event.c Â Â Â Â Â | Â 47 +++-
> Âarch/x86/kernel/cpu/perf_event.h Â Â Â Â Â | Â 19 +
> Âarch/x86/kernel/cpu/perf_event_amd.c Â Â Â | Â Â3 +
> Âarch/x86/kernel/cpu/perf_event_intel.c Â Â | Â120 +++++--
> Âarch/x86/kernel/cpu/perf_event_intel_ds.c Â| Â 22 +-
> Âarch/x86/kernel/cpu/perf_event_intel_lbr.c | Â525 ++++++++++++++++++++++++++--
> Âinclude/linux/perf_event.h Â Â Â Â Â Â Â Â | Â 78 ++++-
> Âkernel/events/core.c Â Â Â Â Â Â Â Â Â Â Â | Â167 +++++++++
> Âkernel/events/hw_breakpoint.c Â Â Â Â Â Â Â| Â Â6 +
> Âtools/perf/Documentation/perf-record.txt Â | Â 18 +
> Âtools/perf/Documentation/perf-report.txt Â | Â Â7 +
> Âtools/perf/builtin-record.c Â Â Â Â Â Â Â Â| Â 69 ++++
> Âtools/perf/builtin-report.c Â Â Â Â Â Â Â Â| Â 95 +++++-
> Âtools/perf/perf.h Â Â Â Â Â Â Â Â Â Â Â Â Â| Â 18 +
> Âtools/perf/util/annotate.c Â Â Â Â Â Â Â Â | Â Â2 +-
> Âtools/perf/util/event.h Â Â Â Â Â Â Â Â Â Â| Â Â1 +
> Âtools/perf/util/evsel.c Â Â Â Â Â Â Â Â Â Â| Â 14 +
> Âtools/perf/util/hist.c Â Â Â Â Â Â Â Â Â Â | Â 93 ++++-
> Âtools/perf/util/hist.h Â Â Â Â Â Â Â Â Â Â | Â Â7 +
> Âtools/perf/util/session.c Â Â Â Â Â Â Â Â Â| Â 72 ++++
> Âtools/perf/util/session.h Â Â Â Â Â Â Â Â Â| Â Â4 +
> Âtools/perf/util/sort.c Â Â Â Â Â Â Â Â Â Â | Â361 ++++++++++++++-----
> Âtools/perf/util/sort.h Â Â Â Â Â Â Â Â Â Â | Â Â5 +
> Âtools/perf/util/symbol.h Â Â Â Â Â Â Â Â Â | Â 13 +
> Â31 files changed, 1601 insertions(+), 196 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: PINTU KUMAR: "[Help] : RSS/PSS showing 0 during smaps for Xorg"
Previous message: Saugata Das: "Re: [PATCH v3 0/2] mmc: core: Support packed command feature of eMMC4.5"
Next in thread: Peter Zijlstra: "Re: [PATCH 00/13] perf_events: add support for sampling takenbranches (v3)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]