x86 entry perf unwinding failure (missing IRET_REGS annotation on stack switch?)

From: Jann Horn
Date: Sun Mar 01 2020 - 01:02:47 EST


Hi!

With the latest kernel from Linus' tree (fb279f4e2386), on a box that
needs KPTI, as soon as I run "sudo <path>/perf top -g", I see this
warning (if I modify the unwinding code to use %px there instead of
%p):

[ 83.716065] WARNING: stack recursion on stack type 4
[ 83.716070] WARNING: can't dereference registers at
fffffe0000002190 for ip
swapgs_restore_regs_and_return_to_usermode+0x79/0x80

So apparently some ORC annotations are a bit off. I decided to write a
little helper script that can interleave "objtool orc dump" and
"objdump -d" output to make it easier for me to understand what's
going on; you can save it to tools/objtool and run it from there:

==============
#!/usr/bin/env python3
import subprocess
import sys
import re

objfile = sys.argv[1]

orc_dump = subprocess.run([sys.path[0]+'/objtool', 'orc', 'dump',
objfile], stdout=subprocess.PIPE).stdout.decode('utf-8')
orc_map = {}
for line in orc_dump.split('\n'):
parts = line.split(': ', 1)
if len(parts) != 2:
continue
orc_map[parts[0]] = parts[1]

objdump = subprocess.run(['objdump', '-d', objfile],
stdout=subprocess.PIPE).stdout.decode('utf-8')
cur_section = None
hex_re = re.compile('^[a-f0-9]+$')
for line in objdump.split('\n'):
if line.startswith('Disassembly of section '):
cur_section = line.split('of section ')[1].split(':')[0]
if line.find(':') != -1:
hex_prefix = line.split(':')[0].lstrip()
if hex_re.match(hex_prefix):
orc_val = orc_map.get(cur_section+'+'+hex_prefix)
if orc_val != None:
print('\033[92m#######' + orc_val + '\033[0m')
print(line)
==============

swapgs_restore_regs_and_return_to_usermode+0x79/0x80 is offset 0xaa8
in the object file. Here's the interleaved assembly and ORC unwind
info (with ORC info appearing above the line it refers to):

==============
0000000000000a00 <common_interrupt>:
#######sp:sp+8 bp:(und) type:iret end:0
a00: 48 83 04 24 80 addq $0xffffffffffffff80,(%rsp)
a05: e8 00 00 00 00 callq a0a <common_interrupt+0xa>
#######sp:(sp+0) bp:(und) type:regs end:0
a0a: e8 00 00 00 00 callq a0f <ret_from_intr>

0000000000000a0f <ret_from_intr>:
a0f: fa cli
a10: 5c pop %rsp
#######sp:sp+0 bp:(und) type:regs end:0
a11: 65 ff 0c 25 00 00 00 decl %gs:0x0
a18: 00
a19: f6 84 24 88 00 00 00 testb $0x3,0x88(%rsp)
a20: 03
a21: 0f 84 88 00 00 00 je aaf <retint_kernel>
a27: 48 89 e7 mov %rsp,%rdi
a2a: e8 00 00 00 00 callq a2f
<swapgs_restore_regs_and_return_to_usermode>

0000000000000a2f <swapgs_restore_regs_and_return_to_usermode>:
a2f: 41 5f pop %r15
#######sp:sp-8 bp:(und) type:regs end:0
a31: 41 5e pop %r14
#######sp:sp-16 bp:(und) type:regs end:0
a33: 41 5d pop %r13
#######sp:sp-24 bp:(und) type:regs end:0
a35: 41 5c pop %r12
#######sp:sp-32 bp:(und) type:regs end:0
a37: 5d pop %rbp
#######sp:sp-40 bp:(und) type:regs end:0
a38: 5b pop %rbx
#######sp:sp-48 bp:(und) type:regs end:0
a39: 41 5b pop %r11
#######sp:sp-56 bp:(und) type:regs end:0
a3b: 41 5a pop %r10
#######sp:sp-64 bp:(und) type:regs end:0
a3d: 41 59 pop %r9
#######sp:sp-72 bp:(und) type:regs end:0
a3f: 41 58 pop %r8
#######sp:sp-80 bp:(und) type:regs end:0
a41: 58 pop %rax
#######sp:sp-88 bp:(und) type:regs end:0
a42: 59 pop %rcx
#######sp:sp-96 bp:(und) type:regs end:0
a43: 5a pop %rdx
#######sp:sp-104 bp:(und) type:regs end:0
a44: 5e pop %rsi
#######sp:sp-112 bp:(und) type:regs end:0
a45: 48 89 e7 mov %rsp,%rdi
a48: 65 48 8b 24 25 00 00 mov %gs:0x0,%rsp
a4f: 00 00
a51: ff 77 30 pushq 0x30(%rdi)
#######sp:sp-104 bp:(und) type:regs end:0
a54: ff 77 28 pushq 0x28(%rdi)
#######sp:sp-96 bp:(und) type:regs end:0
a57: ff 77 20 pushq 0x20(%rdi)
#######sp:sp-88 bp:(und) type:regs end:0
a5a: ff 77 18 pushq 0x18(%rdi)
#######sp:sp-80 bp:(und) type:regs end:0
a5d: ff 77 10 pushq 0x10(%rdi)
#######sp:sp-72 bp:(und) type:regs end:0
a60: ff 37 pushq (%rdi)
#######sp:sp-64 bp:(und) type:regs end:0
a62: 50 push %rax
#######sp:sp-56 bp:(und) type:regs end:0
a63: eb 43 jmp aa8
<swapgs_restore_regs_and_return_to_usermode+0x79>
a65: 0f 20 df mov %cr3,%rdi
a68: eb 34 jmp a9e
<swapgs_restore_regs_and_return_to_usermode+0x6f>
a6a: 48 89 f8 mov %rdi,%rax
a6d: 48 81 e7 ff 07 00 00 and $0x7ff,%rdi
a74: 65 48 0f a3 3c 25 00 bt %rdi,%gs:0x0
a7b: 00 00 00
a7e: 73 0f jae a8f
<swapgs_restore_regs_and_return_to_usermode+0x60>
a80: 65 48 0f b3 3c 25 00 btr %rdi,%gs:0x0
a87: 00 00 00
a8a: 48 89 c7 mov %rax,%rdi
a8d: eb 08 jmp a97
<swapgs_restore_regs_and_return_to_usermode+0x68>
a8f: 48 89 c7 mov %rax,%rdi
a92: 48 0f ba ef 3f bts $0x3f,%rdi
a97: 48 81 cf 00 08 00 00 or $0x800,%rdi
a9e: 48 81 cf 00 10 00 00 or $0x1000,%rdi
aa5: 0f 22 df mov %rdi,%cr3
aa8: 58 pop %rax
#######sp:sp-64 bp:(und) type:regs end:0
aa9: 5f pop %rdi
#######sp:sp-72 bp:(und) type:regs end:0
aaa: 0f 01 f8 swapgs
aad: eb 41 jmp af0 <native_iret>
==============

It looks to me like things go wrong at the point where we switch over
to the trampoline stack? The ORC info claims that we have full user
registers on the trampoline stack (and that we're clobbering them with
our pushes - apparently objtool is not smart enough to realize that
that looks bogus), but at that point we should probably actually use
something like UNWIND_HINT_IRET_REGS, right?


By the way, looking through the rest of the entry stuff, there's some
other funny-looking stuff, too:

============
0000000000000f40 <general_protection>:
#######sp:sp+8 bp:(und) type:iret end:0
f40: 90 nop
#######sp:(und) bp:(und) type:call end:0
f41: 90 nop
f42: 90 nop
#######sp:sp+8 bp:(und) type:iret end:0
f43: e8 a8 01 00 00 callq 10f0 <error_entry>
#######sp:sp+0 bp:(und) type:regs end:0
f48: f6 84 24 88 00 00 00 testb $0x3,0x88(%rsp)
f4f: 03
f50: 74 00 je f52 <general_protection+0x12>
f52: 48 89 e7 mov %rsp,%rdi
f55: 48 8b 74 24 78 mov 0x78(%rsp),%rsi
f5a: 48 c7 44 24 78 ff ff movq $0xffffffffffffffff,0x78(%rsp)
f61: ff ff
f63: e8 00 00 00 00 callq f68 <general_protection+0x28>
f68: e9 73 02 00 00 jmpq 11e0 <error_exit>
#######sp:(und) bp:(und) type:call end:0
f6d: 0f 1f 00 nopl (%rax)
============

So I think on machines without X86_FEATURE_SMAP, trying to unwind from
the two NOPs at f41 and f42 will cause the unwinder to report an
error? Looking at unwind_next_frame(), "sp:(und)" without the "end:1"
marker seems to be reserved for errors.