Re: [patch] x86_64: fix boot hang caused by CALGARY_IOMMU_ENABLED_BY_DEFAULT

From: Ingo Molnar
Date: Thu Dec 21 2006 - 05:39:55 EST



* Muli Ben-Yehuda <muli@xxxxxxxxxx> wrote:

> > > it would be used, period. You may disagree, but fundamentally I
> > > think the mainline kernel should be fairly experimental, which
> > > means enabling new code by default.
> >
> > that's a totally wrong attitude - the mainline kernel is /not/
> > experimental. A distro might or might not enable the new option, but
> > we just dont enable experimental platform support code via "default
> > y"...
>
> I disagree, it seems to me most "experimental platform support code"
> is simply enabled because it doesn't even have a CONFIG option (c.f.,
> recent genirq and IO-APIC breakage on x86-64). With regards to this
> specific option, you might even say that not defaulting to 'y' here
> would be a regression in behaviour against previous released kernels,
> which used Calgary if it was compiled in, no questions asked. So at
> least in that sense, instructing the user to select y if unsure and
> default y are appropriate.

i still disagree. We should be /a lot/ more careful in adding patches
that change platform behavior.

i happen to run a distro kernel, and in fact i'm maintaining the -rt yum
repository which offers a distro kernel and which also tracks mainline
very closely.

i agree with you that Calgary support, if it was enabled (which it is in
my config), would run through the calgary_detect() function. The
breakage was not caused by the default-y patch, it was caused by the
other patch preceding it:

commit b34e90b8f0f30151349134f87b5dc6ef75a5218c
Date: Thu Dec 7 02:14:06 2006 +0100

[PATCH] Calgary: use BIOS supplied BBARs and topology information

I think in the future it would be better to annotate the introduction of
new, widely used codepaths via KERN_DEBUG printouts, something along the
lines of:

printk(KERN_DEBUG "calgary: running new EBDA code.\n");
...
printk(KERN_DEBUG "calgary: done.\n");

That way "debug ignore_loglevel console=tty" bootopions would have
uncovered the source of this hang. Just like i can use the
initcall_debug boot option to figure out where the bootup hangs.

but i still /strongly/ disagree with your attitude that mainline is
'experimental' and hence there's nothing to see here, move over.

> > The other problem is that the changelog entry says that it's off by
> > default, while in reality the new option switched this code on for
> > my box, and broke it.
>
> Sorry about that (both the wrong changelog entry and the fact that it
> broke your box).

there's really no need to apologize, i probably broke your box a few
orders of magnitude more times than you broke mine ;)

Nevertheless my point is that we /have/ to be more supportive of early
adopter distro kernels (Dave Jones says that the same bug has hit Fedora
rawhide too), and have to be doubly careful about anything that goes
into code that is run by /everyone/. Especially if it's in a hard to
debug place like early bootup code.

This incident i think shows that bisection and testing in -mm doesnt
always cut it (bisection-testing is quite laborous with a large distro
kernel), and i think we could do more technological measures in this
area to lower the bar of entry for users to help us narrow down such
bugs. Such as a KERN_DEBUG policy for annotating new code in the bootup
path.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/