[PATCH] x86: Have debug/nmi restore the IDT it replaced

From: Steven Rostedt
Date: Thu May 23 2013 - 21:53:31 EST

I've sent this out before, and I'm sending it out again.


It was ignored because I signed it with "Not-yet-signed-off-by". This
time I'm signing it off with a real full signed off by, as this will
help with Seiji's patch set to do the IDT swap to enable tracing of
interrupt vectors.

I've been playing with Seiji's patches:


Which add tracing to the x86 interrupt vectors. But to keep tracing
overhead down to zero when tracing is inactive, it uses a swap of the
IDT to replace the irq vectors with versions that have tracepoints in
them, when tracing is enabled. Even though tracepoints have "nops", we
do this to keep even that overhead eliminated.

These seemed to work, until I ran the following command with trace-cmd:

trace-cmd start -p function_graph -g "smp_trace_apic_timer_interrupt" \
-g "smp_apic_timer_interrupt" -e irq_vectors

What trace-cmd does above, is not only enables the irq_vectors, but also
produces the call graphs of the two interrupt vector functions:
smp_trace_apic_timer_interrupt and smp_apic_timer_interrupt

The result was rather surprising. It showed only
smp_apic_timer_interrupt being called, and not the trace version.
Obviously, the tracepoints for the vectors were also not there.

When starting without the function graph tracer, it worked fine, but I
wanted to see the trace versions being called to be sure, which required
one of the function tracers.

Investigating, I found the problem. It's with the NMI and breakpoint IDT
handling. I wont explain it too much here, but see:

commit f8988175f "x86: Allow nesting of the debug stack IDT setting"
commit 42181186a "x86: Add counter when debug stack is used with
interrupts enabled"
commit 228bdaa95 "x86: Keep current stack in NMI breakpoints"

The basic gist of the above commits is that on NMI or DEBUG trap
entering, it needs to modify the IDT so that the stack pointer doesn't
get reset to the top of the stack again. On boot up, two IDT tables are
created. One for normal operations and one for this NMI/DEBUG case.

The issue is that it just toggles between the two IDT tables. But if
this happens when Seiji's swap had already happened, it replaces the
trace IDT table back to the normal table, and tracing stops, which sorta
defeats the purpose.

To solve this, I've added a 'hold_idt_descr' per cpu variable that
stores the IDT that was loaded and will use that to replace it. If the
DEBUG/NMI came in when the IDT was normal, it would replace it back with
the normal IDT, and if it came in when it was the trace IDT, it would
put back the trace IDT.

I've run a few tests so far on this code, but I need to run more
stressful ones. Meanwhile, until I find any bugs, I'm posting this patch
for your enjoyment. I think I got all the cases, as NMIs causes the
store/restore functions to be re-entrent without any locks.

Signed-off-by: Steven Rostedt <rostedt@xxxxxxxxxxx>

Index: linux-trace.git/arch/x86/kernel/cpu/common.c
--- linux-trace.git.orig/arch/x86/kernel/cpu/common.c
+++ linux-trace.git/arch/x86/kernel/cpu/common.c
@@ -1148,10 +1148,54 @@ int is_debug_stack(unsigned long addr)

static DEFINE_PER_CPU(u32, debug_stack_use_ctr);
+static DEFINE_PER_CPU(struct desc_ptr, hold_idt_descr);
+static DEFINE_PER_CPU(struct desc_ptr, copy_idt_descr);

+ * Debug traps and NMIs will swap the IDT to have the debug
+ * trap not modify the stack (nmi_idt_descr). But as both the
+ * debug and NMIs share this, they need to be re-entrant. A debug
+ * trap could be doing the swap and after it incs the debug_stack_use_ctr,
+ * an NMI could come in. It too would need to do the swap, and it would
+ * need to swap every time.
+ *
+ * But, the IDT can be changed by other users, and this must restore
+ * that as well.
+ *
+ * Luckily, as interrupts are disabled from the set_zero to reset,
+ * we do not need to worry about the IDT changing underneath us
+ * from other users.
+ */
void debug_stack_set_zero(void)
+ /*
+ * Writing to the IDT is atomic. If an NMI were to come
+ * in after we did the compare but before the store_idt(),
+ * it too would see the address == 0 and do the store itself.
+ */
+ if (this_cpu_read(hold_idt_descr.address) == 0)
+ store_idt(this_cpu_ptr(&hold_idt_descr));
+ /*
+ * If an NMI were to come in now, it would not set hold_idt_descr,
+ * but on exit, it would restore the IDT because the counter is
+ * still zero here. Then it would set the .address back to zero too.
+ */
+ /*
+ * NMI's coming in now are not an issue as we set the .address
+ * and also incremented the ctr. But, if an NMI came in before
+ * the counter was incremented, and after the .address was set,
+ * the NMI would set the .address back to zero, and would have
+ * restored the IDT. We need to check if that happened.
+ * If it did, then the .address would be zero here again.
+ */
+ if (unlikely(this_cpu_read(hold_idt_descr.address) == 0))
+ store_idt(this_cpu_ptr(&hold_idt_descr));
+ /* On enter, we always want to use the nmi_idt_descr */
load_idt((const struct desc_ptr *)&nmi_idt_descr);

@@ -1159,8 +1203,39 @@ void debug_stack_reset(void)
if (WARN_ON(!this_cpu_read(debug_stack_use_ctr)))
- if (this_cpu_dec_return(debug_stack_use_ctr) == 0)
- load_idt((const struct desc_ptr *)&idt_descr);
+ if (WARN_ON(!this_cpu_read(hold_idt_descr.address)))
+ return;
+ /*
+ * This is the tricky part. We need to restore the old
+ * IDT to what it was before we entered, but an NMI could
+ * come in at any point, and do the same.
+ *
+ * If the count is 1, then we are the first to enter and
+ * we need to update the IDT. But, we must do that after
+ * we decrement the counter, in which case, if an NMI
+ * comes in, it too will see the 1. To get around this,
+ * we update a copy first.
+ *
+ * The copy will always contain what we want to load the
+ * IDT with.
+ */
+ if (this_cpu_read(debug_stack_use_ctr) == 1)
+ *this_cpu_ptr(&copy_idt_descr) = *this_cpu_ptr(&hold_idt_descr);
+ if (this_cpu_dec_return(debug_stack_use_ctr) == 0) {
+ /*
+ * We use the copy to save to the IDT, as it will always contain
+ * what we want to restore the IDT to.
+ */
+ load_idt(this_cpu_ptr(&copy_idt_descr));
+ /*
+ * Now clear out the hold_idt_descr.address, to let all new
+ * users restore it from the IDT.
+ */
+ this_cpu_write(hold_idt_descr.address, 0);
+ }

#else /* CONFIG_X86_64 */

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/