Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

From: Andy Lutomirski
Date: Mon Nov 17 2014 - 19:55:55 EST


On Mon, Nov 17, 2014 at 4:22 PM, Luck, Tony <tony.luck@xxxxxxxxx> wrote:
>> It could also be interesting to tweak mce_panic to not actually panic
>> the machine but to try to return and stop the test instead. Then real
>> debugging could be possible :)
>
> The lost cpu is *really* lost. Warm reset doesn't fix the machine, I usually
> have to do a full power cycle.

How is it even possible that I did that with a few lines of asm?

Could this be a hardware bug? Is there some condition that causes #MC
delivery to wedge hard enough that even INIT/RESET stops working? Or
possibly some CPU got stuck in SMM -- I have no idea what warm reset
does these days.

My initial attempts to test machine_check in KVM using IPIs are having
some issues, probably because I'm not acking the interrupt. I can do
it once, but then it stops working.

Here's the patch to improve the timeout messages, but given the degree
of wedgedness, I can guess what it'll say:

https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/paranoid&id=e5cbd9d141bde651ecb20f0b65ad13bcef2468d0

--Andy

>
> -Tony



--
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/