[RFC][PATCH] tracing/x86: Save CR2 before tracing irqsoff on error_entry

From: Steven Rostedt
Date: Wed Mar 20 2019 - 22:15:40 EST


From: Steven Rostedt (VMware) <rostedt@xxxxxxxxxxx>

He Zhe reported a crash by enabling trace events and selecting
"userstacktrace" which will read the stack of userspace for every trace
event recorded. Zhe narrowed it down to:

c3bc8fd637a9 ("tracing: Centralize preemptirq tracepoints and unify their usage")

With the problem config, I was able to also reproduce the error. I
narrowed it down to just having to do the following:

# cd /sys/kernel/tracing
# echo 1 > options/userstacktrace
# echo 1 > events/preemptirq/irq_disable/enable

And sure enough, I triggered a crash. Well, it was systemd crashing
with a bad memory access??

systemd-journal[537]: segfault at ed8cb8 ip 00007f7fffc9fef5 sp 00007ffc4062cb10 error 7

And it would crash similarly each time I tried it, but always at a
different place. After spending the day on this, I finally figured it
out. The bug is happening in entry_64.S right after error_entry.
There's two TRACE_IRQS_OFF in that code path, which if I comment out,
the bug goes away. Then it dawned on me that the crash always happens
when systemd does a normal page fault. We had this bug before, and it
was with the exception trace points.

The issue is that a tracepoint can fault (reading vmalloc or whatever).
And doing a userspace stack trace most definitely will fault. But if we
are coming from a legitimate page fault, the address of that fault (in
the CR2 register) will be lost if we fault before we get to the page
fault handler. That's exactly what is happening.

To solve this, a TRACE_IRQS_OFF_CR2 (and ON for consistency) was added
that saves the CR2 register. A new trace_hardirqs_off_thunk_cr2 is
created that stores the cr2 register, calls the
trace_hardirqs_off_caller, then on return restores the cr2 register if
it changed, before returning.

On my tests this fixes the issue. I just want to know if this is a
legitimate fix or if someone can come up with a better fix?

Note: this also saves the exception context just like the
do_page_fault() function does.

Note2: This only gets enabled when lockdep or irq tracing is enabled,
which is not recommended for production environments.

Link: http://lkml.kernel.org/r/897cf5cf-fc24-8a64-cb28-847f2d2e63d2@xxxxxxxxxxxxx

Fixes: c3bc8fd637a9 ("tracing: Centralize preemptirq tracepoints and unify their usage")
Signed-off-by: Steven Rostedt (VMware) <rostedt@xxxxxxxxxxx>
---
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 7bc105f47d21..7edffec328e2 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -292,6 +292,32 @@ __visible void do_syscall_64(unsigned long nr, struct pt_regs *regs)

syscall_return_slowpath(regs);
}
+
+extern void trace_hardirqs_on_caller(unsigned long caller_addr);
+__visible void trace_hardirqs_on_caller_cr2(unsigned long caller_addr)
+{
+ unsigned long address = read_cr2(); /* Get the faulting address */
+ enum ctx_state prev_state;
+
+ prev_state = exception_enter();
+ trace_hardirqs_on_caller(caller_addr);
+ if (address != read_cr2())
+ write_cr2(address);
+ exception_exit(prev_state);
+}
+
+extern void trace_hardirqs_off_caller(unsigned long caller_addr);
+__visible void trace_hardirqs_off_caller_cr2(unsigned long caller_addr)
+{
+ unsigned long address = read_cr2(); /* Get the faulting address */
+ enum ctx_state prev_state;
+
+ prev_state = exception_enter();
+ trace_hardirqs_off_caller(caller_addr);
+ if (address != read_cr2())
+ write_cr2(address);
+ exception_exit(prev_state);
+}
#endif

#if defined(CONFIG_X86_32) || defined(CONFIG_IA32_EMULATION)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 1f0efdb7b629..73ddf24fed3e 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1248,12 +1248,12 @@ ENTRY(error_entry)
* we fix gsbase, and we should do it before enter_from_user_mode
* (which can take locks).
*/
- TRACE_IRQS_OFF
+ TRACE_IRQS_OFF_CR2
CALL_enter_from_user_mode
ret

.Lerror_entry_done:
- TRACE_IRQS_OFF
+ TRACE_IRQS_OFF_CR2
ret

/*
diff --git a/arch/x86/entry/thunk_64.S b/arch/x86/entry/thunk_64.S
index be36bf4e0957..1300b53b98cb 100644
--- a/arch/x86/entry/thunk_64.S
+++ b/arch/x86/entry/thunk_64.S
@@ -41,6 +41,8 @@
#ifdef CONFIG_TRACE_IRQFLAGS
THUNK trace_hardirqs_on_thunk,trace_hardirqs_on_caller,1
THUNK trace_hardirqs_off_thunk,trace_hardirqs_off_caller,1
+ THUNK trace_hardirqs_on_thunk_cr2,trace_hardirqs_on_caller_cr2,1
+ THUNK trace_hardirqs_off_thunk_cr2,trace_hardirqs_off_caller_cr2,1
#endif

#ifdef CONFIG_DEBUG_LOCK_ALLOC
diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
index 058e40fed167..dd511742ca2e 100644
--- a/arch/x86/include/asm/irqflags.h
+++ b/arch/x86/include/asm/irqflags.h
@@ -172,9 +172,13 @@ static inline int arch_irqs_disabled(void)
#ifdef CONFIG_TRACE_IRQFLAGS
# define TRACE_IRQS_ON call trace_hardirqs_on_thunk;
# define TRACE_IRQS_OFF call trace_hardirqs_off_thunk;
+# define TRACE_IRQS_ON_CR2 call trace_hardirqs_on_thunk_cr2;
+# define TRACE_IRQS_OFF_CR2 call trace_hardirqs_off_thunk_cr2;
#else
# define TRACE_IRQS_ON
# define TRACE_IRQS_OFF
+# define TRACE_IRQS_ON_CR2
+# define TRACE_IRQS_OFF_CR2
#endif
#ifdef CONFIG_DEBUG_LOCK_ALLOC
# ifdef CONFIG_X86_64