Re: frequent lockups in 3.18rc4

From: Vivek Goyal
Date: Wed Nov 19 2014 - 11:28:21 EST

Next message: Lee Jones: "Re: [PATCH v3 2/2] mfd: dln2: add support for USB-SPI module"
Previous message: Mika Westerberg: "Re: [PATCH] ACPI: Add _DEP(Operation Region Dependencies) support to fix battery issue on the Asus T100TA"
In reply to: Dave Jones: "Re: frequent lockups in 3.18rc4"
Next in thread: Dave Jones: "Re: frequent lockups in 3.18rc4"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Wed, Nov 19, 2014 at 10:38:52AM -0500, Dave Jones wrote:
> On Wed, Nov 19, 2014 at 10:03:33AM -0500, Vivek Goyal wrote:
>
> > Not being able to capture the dump I can understand but having wedged
> > the machine so that it does not reboot after dump failure sounds bad.
> > So you could not get machine to boot even after a power cycle? Would
> > you remember what was failing. I am curious to know what did kdump do
> > to make machine unbootable.
>
> Power cycling was fine, because then it booted into the non-kdump kernel.
> The issue was when I caused that kernel to panic, it would just sit there
> wedged, with no indication it even tried to switch to the kdump kernel.

I have seen the cases where we fail to boot in second kernel and often
failure can happen very early without any information on graphic console.
I have to always hook up a serial console to get an idea what went wrong
that early. It is not an idea situation but at the same time don't know
how to improve it.

I am wondering may be in some cases we panic in second kernel and sit
there. Probably we should append a kernel command line automatically
say "panic=1" so that it reboots itself if second kernel panics.

By any chance, have you enabled "CONFIG_RANDOMIZE_BASE"? If yes, please
disable that as currently kexec/kdump stuff does not work with it. And
it hangs very early in the boot process and I had to hook serial console
to get following message on console.

arch/x86/boot/compressed/misc.c
error("32-bit relocation outside of kernel!\n");

I noticed that error() halts in a while loop after error message. May be
there can be some way for it to try to reboot instead of halting in
while loop.

>
> > > > Unless there's some magic step missing from the documentation at
> > > > http://fedoraproject.org/wiki/How_to_use_kdump_to_debug_kernel_crashes
> > > > then I'm not optimistic it'll be useful.
> >
> > I had a quick look at it and it basically looks fine. In fedora ideally
> > it is just two steps process.
> >
> > - Reserve memory using crashkernel. Say crashkernel=160M
> > - systemctl start kdump
> > - Crash the system or wait for it to crash.
> >
> > So despite your bad experience in the past, I would encourage you to
> > give it a try.
>
> 'the past' here, is two weeks ago, on Fedora 21.
>
> But, since then, I've reinstalled that box with Fedora 20 because I didn't
> trust gcc 4.9, and on f20 things are actually even worse.
>
> Right now it doesn't even create the image correctly:
>
> dracut: *** Stripping files done ***
> dracut: *** Store current command line parameters ***
> dracut: *** Creating image file ***
> dracut: *** Creating image file done ***
> kdumpctl: cat: write error: Broken pipe
> kdumpctl: kexec: failed to load kdump kernel
> kdumpctl: Starting kdump: [FAILED]

Hmmm..., can you please enable debugging in kdumpctl using "set -x" and
do "touch /etc/kdump.conf; kdumpctl restart" and give debug output to me.

>
> It works if I run a Fedora kernel, but not with a self-built one.
> And there's zero information as to what I'm doing wrong.

I just tested F20 kdump on my box and it worked fine for me.

So for you second kernel hangs and there is no info on console? Is there
any possibility to hook up serial console, enable early printk and see
if soemthing shows up there.

Apart from this, if you run into kdump issues in fedora, please cc
kexec fedora mailing list too so that we are aware of it.

https://lists.fedoraproject.org/mailman/listinfo/kexec

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Lee Jones: "Re: [PATCH v3 2/2] mfd: dln2: add support for USB-SPI module"
Previous message: Mika Westerberg: "Re: [PATCH] ACPI: Add _DEP(Operation Region Dependencies) support to fix battery issue on the Asus T100TA"
In reply to: Dave Jones: "Re: frequent lockups in 3.18rc4"
Next in thread: Dave Jones: "Re: frequent lockups in 3.18rc4"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]