Re: Hardware Error Kernel Mini-Summit

From: Andi Kleen
Date: Tue Jun 15 2010 - 15:35:30 EST


> But there are bugs. And correcting them is so prohibitively
> expensive that I don't even want to think about it. And when

Something is wrong in your setup then.

> the BIOS messes up, it's the device driver writers who have to
> magically workaround the problems.

In this case you would need the equivalent information
of a system specific DMI table in some device driver.

Do you see how this does not fly? How should a device
driver know more about the system than the BIOS?

And if you can load some specific table into the device
driver why can't you simply update the BIOS too?

Well you can supply your own if you're a power user
anyways, but most users are not power users. So it's no
option as a default.

Or could you imagine a standard server getting installed
and asking with a desktop window "please enter the DIMM mappings
by hand"? That simply doesn't make any sense.


>
> Could we come up with some plan that doesn't involve
> trusting to the goodwill (and competence) of BIOS writes?

the problem is that the information is nowhere else.
If the BIOS doesn't know it Linux certainly doesn't know it either.

On the other hand if Linux uses this information there is certainly
an angle to get at least server vendors to fix their stuff
(and non servers do not matter for memory errors because they
run in non ECC mode anyways)

It's certainly in the server vendors own interest to supply correct
information here anyways. If they don't it will cost them in
unnecessary memory replacement costs.

BTW on the systems I have access to DMI seems to be largely
correct these days. I guess your system is a unlucky exception.

Maybe your BIOS people will do something useful next generation.
Make sure to report it to them and if they don't fix it make fun of them.


> but maybe there could be some way to apply the same principle? Maybe
> some way of loading modules with parameters or configuring your setup
> from sysfs?

Having a DMI override is no problem at all. ACPI uses this all the time
for example.

No need at all to speak a foreign language for this, even if it's your
mother tongue.

-Andi
--
ak@xxxxxxxxxxxxxxx -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/