Re: kexec on panic

From: Denys Fedoryshchenko
Date: Sat Feb 18 2017 - 03:16:34 EST


On 2017-02-18 09:42, Jon Masters wrote:
Hi Denys,

On 02/10/2017 03:14 AM, Denys Fedoryshchenko wrote:

After years of using kexec and recent unpleasant experience with modern (supposed to be blazing fast to boot) hardware that need 5-10 minutes just to pass POST tests,
one question came up to me:
Is it possible anyhow to execute regular (not special "panic" one to capture crash data) kexec on panic to reduce reboot time?

Generally, you don't want to do this, because various platform hardware
might be in non-quiescent states (still doing DMA to random memory, etc.)
and other nastiness that means you don't want to do more than the minimal
amount in a kexec on panic (crash). We've seen no end of fun and games
even with just regular crash dumps while hardware is busily writing to
memory that it shouldn't be. An IOMMU helps, but isn't a cure-all.

Jon.
Well, i have to try, even sometimes i am facing issues with non-booting hardware even on regular kexec, but having at small customer HP server that need almost 6 minutes to boot,
no hot-spare(and hard to do by many reasons, no spare 10G ports, cost of hardware and etc) and some nasty bugs that is not resolved yet - forcing me to search way to reduce reboot time.
If i will find way to save backtrace and reboot fast, it will help a lot to debug kernels with minimal downtime, if bug is reproducible only on live system.

What i did now, might be insanely wrong, but:
diff -Naur linux-4.9.9-vanilla/kernel/kexec_core.c linux-4.9.9/kernel/kexec_core.c
--- linux-4.9.9-vanilla/kernel/kexec_core.c 2017-02-09 07:08:40.000000000 +0000
+++ linux-4.9.9/kernel/kexec_core.c 2017-02-17 12:54:49.000000000 +0000
@@ -897,6 +897,10 @@
machine_crash_shutdown(&fixed_regs);
machine_kexec(kexec_crash_image);
}
+ if (kexec_image) {
+ machine_shutdown();
+ machine_kexec(kexec_image);
+ }
mutex_unlock(&kexec_mutex);
}
}

Then

kexec -l /mnt/flash/kernel --append="intel_idle.max_cstate=0 processor.max_cstate=1"

and
echo c >/proc/sysrq-trigger
worked even on busy network router, but i'm not sure it will be same on real networking stack crash.