Re: [PATCH 1/3] x86/entry/64: Refactor IRQ stacks and make then NMI-safe

From: Andy Lutomirski
Date: Fri Jul 24 2015 - 14:03:19 EST


On Fri, Jul 24, 2015 at 3:25 AM, Borislav Petkov <bp@xxxxxxxxx> wrote:
> On Thu, Jul 23, 2015 at 11:08:39PM -0700, Andy Lutomirski wrote:
>> To be obviously safe against any local exception, we want a single
>> instruction that will change %rsp and some in-memory flag at the same
>> time. There aren't a whole lot of candidates. Cmpxchg isn't useful
>> (cmpxchg with a memory operand doesn't modify its register operand).
>
> Why would you even need that?
>
> You do LOCK; CMPXCHG on a per_cpu variable and then test ZF? I.e., use
> it as a mutex in asm. With ZF=1, you switch stacks, with ZF=0, you
> busy-wait ...
>
> Or am I missing something?

I'm not worried about other CPUs at all, so the LOCK isn't needed.
I'm worried about an interrupt coming while the in-memory state says
we're on the IRQ stack but RSP isn't pointing at the IRQ stack or vice
versa.

This isn't possible in current kernels because nothing ever switches
to the IRQ stack except IRQs, and those don't happen with IF = 0
(unless Xen does more awful things than I realize), but I want to use
IRQ stacks for int3, and that can happen inside NMI.

An alternative solution would be to never switch to the IRQ stack if
RSP points to the IRQ stack already or if we're already on an IST
stack, but that seems full of corner cases.

But wait. Maybe a really simple approach is fine: first increment
irq_count and then switch RSP. That leaves a window with irq_count
marking the IRQ stack as being in use but RSP not pointing to the IRQ
stack.

What can go wrong? We can get an NMI, an MCE, a breakpoint, or a
vmalloc fault. An NMI is fine. It will switch to the NMI stack. The
IRQ stack is marked as in use, but that doesn't matter -- the NMI
stack has plenty of space. An MCE is fine. It will switch to the IST
stack. It might return using RET, which will push one single extra
word to the kernel stack, but that's not a problem for stack overruns
(unless there's a never-ending stream of MCEs, but we're already
terminally screwed if that happens). A breakpoint will *not* switch
to an IST stack because we're going to get rid of the debug stack, and
it will fail to switch to the IRQ stack, so we need to limit the stack
depth of do_debug. Maybe that's okay. A breakpoint with NMI or MCE
inside is still fine.

So really the only difference between this simple approach (which is
more or less what we do now) and my fancy approach is that a kernel
instruction breakpoint will cause do_debug to run on the initial stack
instead of the IRQ stack.

I'm still tempted to say we should use my overly paranoid atomic
approach for now and optimize later, but I'm fine with spinning a v3,
too.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/