[BUG?] perf: dwarf unwind doesn't work correctly on aarch64

From: Masami Hiramatsu
Date: Thu Feb 23 2017 - 02:51:21 EST


Hello,

perf record -g dwarf (and perf report) doesn't show correct callchain
on aarch64. Here is how to reproduce it.

1) I've prepared an debian8 aarch64 VM on qemu-system-aarch64, and
build/install the latest perftools on it.

2) Build attached program as below
# gcc -O0 -ggdb3 -funwind-tables -o main main.c

3) Run perf to record with dwarf.
# perf record -g --call-graph dwarf,1024 -e cpu-clock:u -o /tmp/perf.data -- ./main
^C[ perf record: Woken up 35 times to write data ]
[ perf record: Captured and wrote 8.526 MB /tmp/perf.data (6495 samples) ]


4) Run perf to report the result.

# perf report -i /tmp/perf.data
# To display the perf.data header info, please use --header/--header-only option
#
#
# Total Lost Samples: 0
#
# Samples: 6K of event 'cpu-clock:u'
# Event count (approx.): 1623750000
#
# Children Self Command Shared Object Symbol
# ........ ........ ....... ............. ..........................
#
17.21% 17.21% main main [.] func2
|
---func2

17.09% 17.09% main main [.] func1
|
---func1

16.67% 16.67% main main [.] main
|
---main
.....

So, as you can see, the call graph reported each function has been
called from itself. If I report it with fp as below, perf reported
correct callgraph.

3') record it with fp
# perf record -g --call-graph fp -e cpu-clock:u -o /tmp/perf.data -- ./main
^C[ perf record: Woken up 2 times to write data ]
[ perf record: Captured and wrote 0.397 MB /tmp/perf.data (4160 samples) ]

4') report it

# perf report -i /tmp/perf.data # To display the perf.data header info, please use --header/--header-only option
#
#
# Total Lost Samples: 0
#
# Samples: 4K of event 'cpu-clock:u'
# Event count (approx.): 1040000000
#
# Children Self Command Shared Object Symbol
# ........ ........ ....... ............. ...........................
#
99.06% 0.00% main libc-2.19.so [.] __libc_start_main
|
---__libc_start_main
|
--98.94%--main
|
|--80.24%--func0
| |
| |--63.27%--func1
| | |
| | |--47.04%--func2
| | | |
.....


I tried to dump but it seems correct.

# perf report -D -i /tmp/perf.data
[...]
. 0130: c8 05 40 00 00 00 00 00 30 e1 a8 df ff ff 00 00 ..@.....0.......
. 0140: 90 05 40 00 00 00 00 00 [00 04 00 00 00 00 00 00](*) ..@.............
. 0150: 50 e1 a8 df ff ff 00 00 e8 05 40 00 00 00 00 00 P.........@.....
. 0160: 00 00 00 00 00 00 00 00 00 00 00 00 08 00 00 00 ................
. 0170: 70 e1 a8 df ff ff 00 00 08 06 40 00 00 00 00 00 p.........@.....
. 0180: 8c b6 c9 8f ff ff 00 00 00 00 00 00 04 00 00 00 ................
. 0190: 90 e1 a8 df ff ff 00 00 28 06 40 00 00 00 00 00 ........(.@.....
. 01a0: c0 e1 a8 df ff ff 00 00 2c 1d b6 8f 02 00 00 00 ........,.......
. 01b0: b0 e1 a8 df ff ff 00 00 40 06 40 00 00 00 00 00 ........@.@.....
. 01c0: c0 e1 a8 df ff ff 00 00 48 1d b6 8f 01 00 00 00 ........H......
[...]
1207680984048 0xf040 [0x560]: PERF_RECORD_SAMPLE(IP, 0x2): 114/114: 0x400590 per
... FP chain: nr:0
... user regs: mask 0x1ffffffff ABI 64-bit
[...]
.... x29 0xffffdfa8e130
.... lr 0x4005c8
.... sp 0xffffdfa8e130
.... pc 0x400590
... ustack: size 1024, offset 0x148
. data_src: 0x5080021
... thread: main:114

In this entry, ustack should start from offset=0x0148 in event raw data,
I marked it with (*), which is the saved stack size(0x400 = 1024) and
the top of stack has 0xffffdfa8e150 which seems next frame pointer, and
after that there is 0x4005e8, which is next return address.

00000000004005b0 <func3>:
4005b0: a9be7bfd stp x29, x30, [sp,#-32]!
4005b4: 910003fd mov x29, sp
4005b8: b9001fa0 str w0, [x29,#28]
4005bc: b9401fa0 ldr w0, [x29,#28]
4005c0: 11002000 add w0, w0, #0x8
4005c4: 97fffff3 bl 400590 <func4>
4005c8: a8c27bfd ldp x29, x30, [sp],#32
4005cc: d65f03c0 ret

00000000004005d0 <func2>:
4005d0: a9be7bfd stp x29, x30, [sp,#-32]!
4005d4: 910003fd mov x29, sp
4005d8: b9001fa0 str w0, [x29,#28]
4005dc: b9401fa0 ldr w0, [x29,#28]
4005e0: 11001000 add w0, w0, #0x4
4005e4: 97fffff3 bl 4005b0 <func3>
4005e8: a8c27bfd ldp x29, x30, [sp],#32

So, the stack data should be correct.

I guess there is a bug in libunwind on aarch64 or we missed to pass
the stack data to libunwind. (BTW, it works correctly on arm32)

Thank you,

--
Masami Hiramatsu <mhiramat@xxxxxxxxxx>
int buf = 0;

void func(int a)
{
buf += a;
}
void func4(int a)
{
func (a+16);
}
void func3(int a)
{
func4(a+8);
}
void func2(int a)
{
func3(a+4);
}
void func1(int a)
{
func2(a+2);
}
void func0(int a)
{
func1(a+1);
}
void main()
{
do {
func0(1);
} while (buf);
}