Re: Hardware Error Kernel Mini-Summit

From: Andi Kleen
Date: Mon Jun 14 2010 - 16:21:19 EST


On Mon, Jun 14, 2010 at 09:47:33PM +0200, Nils Carlson wrote:
>> Also the biggest problem is still that EDAC doesn't
>> give you any silk screen labels, so unless you
>> have motherboard schemantics the layout it presents
>> is fairly useless -- you still don't know which DIMM
>> to exchange. So in theory EDAC looks great, but in practice ...
>>
> I do have motherboard schematics, or rather, we build our own
> boards. But the point is valid, a lot of people don't make their own

Just supply correct DMI tables then?

> hardware. On the other hand, the people who do use this part of
> EDAC perhaps aren't your typical home computer users?

Most users do not build their own boards and do not have
schemantics. And that's not home computer users.

Anyways I think important is that by default you get something
useful (including silk screen labels) without doing
any special configuration steps.

Right now DMI is the only sane option for this that I can see.
EDAC doesn't do it because it has no silk screen labels.

And yes if someone is a power user they could still override
that. Just by default it has to do something reasonable.

>
> This is true, and this is the way things are going on
> our end as well. I guess that would mean
> So you wouldn't go to the EDAC sysfs directory
> to find everything to do with the same piece of hardware
> anymore, but would have to go the n different
> directories looking for all the pieces? I don't really
> like that...

Let me try to understand that.

You want to inject errors on a random computer you don't
know anything about? Do you do that frequently? Why
are you doing this?

Obviously there needs to be a way to identify to what
hardware an error injector belongs.

>
>> Anyways the old EDAC drivers for this are not going
>> away, you can still use them. The interesting
>> question though is how to properly define the interface
>> for new hardware.
>
> But all new hardware will look the way the hardware
> designers want it to, so our interface will be a moving
> target? Maybe it's time to let hardware makers provide

You can define relatively abstract interfaces.

It's just that EDAC is not it. They may not be perfect
future proof (after all who knows how memories of quantum
computers or whatever will look like), but hopefully
at least reasonably forward looking.

e.g. for memory layout imho a reasonable way
is to just define it as

DIMM (if you need below that look at a log)
\-------- silk screen label (most important attribute!)
|
abstract path. This can be an arbitary string. e.g. MC0/Ch1/DIMM0
| Or MC0/BOB0/Ch1/DIMM3
| Parsers don't need to know any details about it.
|
socket

You can event represent that as a flat data structure,
no need to really map the abstract path to directories
(that just makes parsers difficult to write -- most sysfs
parsers traditionally have trouble with varying directories)



> a board specification with device tree and memory
> layout? (Pure speculation)

That's DMI on x86!

Well it's not perfect, but also not too bad.


> There is a use-case. A lot has to do with how different patrol
> scrub rates work, some just go through memory at a constant
> speed (MB/s), others vary according to load. The thing is,
> different applications want their memory scrubbed within
> different time frames, and as the amount of memory on boards

What's the theory behind varying scrub rates?
I would be interested in more details.

> Patrol scrubbing is normally used because it discovers errors
> faster in seldom accessed memory allowing a DIMM with
> too many errors to be replaced faster. Some applications

Yes, but why do you want to vary the rate?
Normally it should just depend on memory size and expected
error rate (that is the more memory the faster you scrub)

> like to use demand scrubbing as well, and some consider
> it to increase memory latency too much.

That sounds odd -- if you have so many errors that you worry
about that you have other problems definitely?
Is this based on some benchmarking?

-Andi

--
ak@xxxxxxxxxxxxxxx -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/