Re: PEBS level 2/3 breaks dwarf unwinding! [WAS: Re: Broken dwarf unwinding - wrong stack pointer register value?]

From: Milian Wolff
Date: Thu Nov 08 2018 - 07:42:02 EST


On Mittwoch, 7. November 2018 23:41:31 CET Milian Wolff wrote:
> On Dienstag, 6. November 2018 21:24:11 CET Andi Kleen wrote:
> > > Where would I look for the source to change here? So far, I only
> > > concentrated on the userspace side of perf in tools/perf.
> >
> > Kind of similar to
> >
> > a405bad5ad20 perf/x86: Add Haswell specific transaction flag reporting
> > fdfbbd07e91f perf: Add generic transaction flags
> >
> > Report the original (not overwritten) regs->ip and regs->sp
>
> Thanks a lot Andi! With your help, I have managed to find the exact issue
> for my scenario. Turns out, it really is "just" the instruction pointer
> that is wrong. I.e. originally we have IP = 0x7feda32ca68c, but with PEBS
> we correct that to IP = 7feda32ca688. The SP register value stays the same
> according to my printk output. Using the original IP value, we can unwind
> correctly since we point to the correct place in the .eh_frame section. The
> PEBS IP points to a different position in the .eh_frame section, which is
> "too early".
>
> That brings up some questions:
>
> - I noticed `perf record --intr-regs`, but the values recorded in the
> perf.data file are always the same. I.e. comparing uregs and iregs, I always
> see the same values printed by `perf script`. This smells like a bug to me,
> but so far I haven't figured out why this happens...

The reason seems to be that perf_event_output only takes one set of registers,
which then gets handed down into perf_prepare_sample where it gets sampled.
Thus if sample type has both PERF_SAMPLE_REGS_USER and PERF_SAMPLE_REGS_INTR
set, then by design both will store the same values for user space samples.

Can we change this, such that perf_event_output also takes a second set of
registers (iregs) that get sampled for PERF_SAMPLE_REGS_INTR? I'm very new to
real kernel development, what kind of ABI/API stability guarantees exist for
something like "perf_event_output"?

> - Independently, when I add a custom printk manually in `arch/x86/events/
> intel/ds.c` at the end of `setup_pebs_sample_data`, then I'm never seeing
> any differences between SP in iregs/pebs/regs. Shouldn't it also be
> recorded via PEBS? Or is it just chance that I'm never seeing any
> difference in setup_pebs_sample_data between iregs->sp and regs->sp?

The reason here seems to be that the registers stored in "pebs" are
essentially the same as iregs for the setup for `perf record --call-graph
dwarf`. The difference is the availability of `pebs->real_ip` which gets used
on my system to fixup the IP. SP stays untouched and is thus only truly valid
for the untouched IP (which is discarded currently - see above).

> - Generally, how do we want to handle this bug? If `--intr-regs` would
> actually record a different IP than stored in uregs in the perf.data file,
> then we could use that as a fallback for unwinding, when it fails the first
> time. Or should we always unwind from that IP? How do we mark the "actual"
> frame/IP then, if that differs?
>
> Thanks


--
Milian Wolff | milian.wolff@xxxxxxxx | Senior Software Engineer
KDAB (Deutschland) GmbH, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt, C++ and OpenGL Experts

Attachment: smime.p7s
Description: S/MIME cryptographic signature