Re: [patch] Re: 2.6.16-rc4-mm1 (bugs and lockups)

From: Andrew Morton
Date: Tue Feb 21 2006 - 20:31:30 EST


Stas Sergeev <stsp@xxxxxxxx> wrote:
>
> Hi.
>
> The history is that -mm kernels do not work for me
> for a few months already. The things started from
> crashing somewhere after starting init, and for the
> last month - no boot at all, just
> "Uncompressing... OK, booting kernel", and silence.
> Early console didn't work too.
> With the latest releases this degraded into an infinite
> stream of the "Unknown interrupt or fault" messages.
> So today my patience ran out and I started to think how
> can I collect at least some info for the bug-report.
> Attached is the patch that allows to gather some valueable
> debug info on the problem by making an early console more
> useable. I can't properly test the patch, as the kernel
> still doesn't boot, so I'll explain it in details in a
> hope someone else can justify the intrusive changes.
>

It's unusual that the failure has been only in -mm, and for so long. That
would indicate that we have a problem in a for-mm-only patch.

And yet your patch applies OK to 2.6.16-rc4, so it's not obvious how this
got in there.

Did you never perform a bisection as per
http://www.zip.com.au/~akpm/linux/patches/stuff/bisecting-mm-trees.txt,
find out which patch in -mm was the offender?

> arch_hooks.h: added prototypes for setup_early_printk()
> and early_printk().
>
> head.S: added "hlt" to the dummy fault handler. This is
> necessary because otherwise the fault retriggers infinitely,
> causing the infinite stream of an "Unknown interrupt or fault"
> messages, which scrolls away the usefull info. I don't know
> if this is a safe change.
>
> setup.c: killed wrong setup_early_printk() prototype.
> Moved setup_early_printk() a bit earlier, as it was not
> "early enough" to cover the bug I was fighting with.
>
> early_printk.c: made it to start printing from the bottom
> of the screen, otherwise the messages interfere with the
> ones of the boot-loader, so you can't read them.
>
> main.c: moved smp_prepare_boot_cpu() call earlier. This
> was necessary because otherwise printk() can't print
> It checks cpu_online(), which returns false. This change
> is consistent with the UP case, where's the boot CPU is
> "online" from the very beginning, AFAICS. But again, I am
> not entirely sure whether this is safe.
>

They all sound like good stuff - I'll take a look, thanks.

> OK, so with that patch I was hoping to collect some debug
> info. It turned out though, that the main.c change also
> fixes the problem itself. The lockup was happening in an
> __alloc_bootmem_core(): the "if (!size)" check was succeeding,
> and the BUG() was triggering. After my main.c change this no
> longer happens, but I don't know where the problem was.
>
> I still can't boot the kernel because of this
> http://www.uwsg.iu.edu/hypermail/linux/kernel/0602.2/1244.html
> but at least I know that with the attached patch, the boot
> process goes much further.
>

Sorry, this was hot-fixed. See
ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.16-rc4/2.6.16-rc4-mm1/hot-fixes/.
You'll want revert-register-sysfs-device-for-lp-devices.patch. May as
well apply the others, too..
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/