Re: [RFC 0/9] mce recovery for Sandy Bridge server

From: Tony Luck
Date: Tue May 24 2011 - 17:48:37 EST


On Tue, May 24, 2011 at 2:30 PM, Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> On Tue, May 24, 2011 at 2:24 PM, Peter Zijlstra <a.p.zijlstra@xxxxxxxxx> wrote:
>>
>> Right, so you can't do things like that from NMI context, but what perf
>> can do is raise a self-IPI and continue from IRQ context (question for
>> the HW folks, can there be cycles between the NMI iret and IRQ assert
>> from whatever context was before the NMI hit?)
>
> Of course there can be - the code where the NMI hit may have
> interrupts disabled.

But the case when I'd want to do the "stop this task" thing is when I
think that I can recover - for memory errors detected while in kernel
code I expect this will only ever be a few special cases:
1) copy to/from user
2) copy page (for copy-on-write fault)
3) ...
and in these cases we don't have interrupts disabled. In fact I have
difficulty imagining a scenario where the kernel trips over a memory
error in interrupt disabled code that would ever be recoverable.

So my NMI handler can look at the saved pt_regs to see whether
it blasted its way into some interrupt disabled code and call that
fatal - if it came in while interrupts were enabled, then it could use
Peter's self-IPI thingy.

-Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/