Re: [PATCH RFC] x86: check for and defend against BIOS memory corruption

From: Jeremy Fitzhardinge
Date: Thu Sep 04 2008 - 19:04:54 EST


Hugh Dickins wrote:
> Well.
>
> Thanks for the prod, and I'm certainly remiss for not following
> up sooner. But I'm really not at all keen on such a patch going
> into mainline myself.
>

I have my original patch with your changes as a followup patch sitting
in my queue. I was planning on sending it in the next day or so. I was
planning on adding the memory size parameter too.

> It's an interesting experiment, and I'd be happy to see such a patch
> (adjusted to make sure output goes to kerneloops.org) spending a little
> while in Fedora Rawhide (who'd be the right contact for that?).
>

Dave Jones? He used to do it, at least. I guess a WARN_ON() would get
picked up by kerneloops.

> But so far as mainline goes, I share Alan Cox's opinion that we should
> not be chopping pages out of every x86 user's memory, just because a
> couple of machines with faulty BIOSes have been observed.
>

I think that's a worthwhile cost for -rc. We can fix it up (ie, make it
a real config option, defaulting off) for release once we're happy that
we understand the scope of the problem.

> Particularly now it's evident that the 64kB "limit" is no more than a
> reflection of where the directmap pagetable changes have caught such
> corruption.
>

We should definitely make the kernel parameter set the banned memory
size so we can experiment with different cutoffs.

> If lots more such corruptions are reported, of course I would change
> my position; but those bad directmap PMD crashes are themselves quite
> recognizable now we know to look out for them.
>
> I would prefer you both to use the minimal memmap= solutions for now;
> but others may disagree.
>

The fact that we're seeing this problem in two completely different
systems with different BIOSes and everything else makes me worried that
this is quite widespread. It's only the persistence and diligence of
our bug reporters that we managed to work out that they're the same
problem. How many other people are getting strange crashes and haven't
managed to correlate it any particular BIOS interaction? Or just happen
to be corrupting memory we don't care about right now, but is only a
small code change or link order change away from disaster?

J
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/