Re: [PATCH -v10 0/4] Lock-less list

From: Tim Hockin
Date: Fri Jan 21 2011 - 12:39:41 EST


OOh ohh, can I jump in? As another of "those Google guys" who has
been dealing with Linux's lack of solutions here for years....

On Thu, Jan 20, 2011 at 2:53 PM, Mike Waychison <mikew@xxxxxxxxxx> wrote:
> On 01/20/11 05:06, Ingo Molnar wrote:
>>
>> * huang ying<huang.ying.caritas@xxxxxxxxx> Âwrote:
>>
>>> On Thu, Jan 20, 2011 at 8:14 PM, Ingo Molnar<mingo@xxxxxxx> Âwrote:
>>>>
>>>> * huang ying<huang.ying.caritas@xxxxxxxxx> Âwrote:
>>>>
>>>>>> But will all that stuff be accepted? Please stop sending
>>>>>> infrastructure bits and
>>>>>> focus on your larger RAS picture, once you have consensus on that from
>>>>>> all
>>>>>> parties involved, then, and only then, does it make sense to submit
>>>>>> everything,
>>>>>> including infrastructure.
>>>>>
>>>>> I am not sending hardware error reporting infrastructure. ÂAs far as I
>>>>> know, Linus
>>>>> and Andrew suggest to use printk for hardware error reporting. ÂAnd
>>>>> now, I just
>>>>> try to write APEI driver and reporting hardware error with printk. ÂIs
>>>>> it
>>>>> acceptable? ÂDo you have some other idea about hardware error
>>>>> reporting?
>>>>
>>>> Erm, how could you possible have missed the perf based RAS daemon work
>>>> of Boris,
>>>> which we've pointed out about half a dozen times already?
>>>
>>> Even if there is some other hardware error reporting infrastructure
>>> such as perf based, I think we still need printk too. After all, as
>>> Linus pointed out, printk is the most popular error reporting
>>> mechanism so far. Do you think so?
>>
>> Of course, that's why the upstream EDAC code uses printk too. In fact it
>> does all
>> sorts of in-kernel decoding to make the printk output more useful - the
>> /dev/mcelog
>> method of pushing all decoding to user-space is fundamentally flawed.

EDAC is fundamentally flawed and we don't use it any more. It strips
off so much information that we can't actually figure out what
happened to the level we want. We do it in userspace now.

> Geez, I don't know how to approach this preposition in a concise way :(
> ÂProcessing machine checks in-kernel is just as flawed as relying on
> /dev/mcelog alone IMO. ÂI agree with you that relying on /dev/mcelog to get
> all of our error data out is flawed, but so is relying on an in-kernel
> "abstraction" of the data exposed from the hardware.
>
>
> There are many different ways a system can fail such that an MCE isn't
> received and processed by the kernel. ÂSometimes the error is just too fatal
> to do anything useful. ÂErrors like a NB buffer CRC error, a bus syncflood,
> or a cache hierarchy ECC error that was incorrectly propagated up through to
> the L1 (which may only have parity checking) can cause the kernel to fall
> over as the CPU is either cut off from the rest of the world or too confused
> to get anything right.
>
> Getting at this information is still very worthwhile however, and I'm
> guessing that this is what the APEI bits are meant to be doing. ÂYou'll be
> seeing patches for Google firmware drivers that provide functionality along
> the same vein in the coming days (I'm still busy whitewashing and
> documenting them).
>
> It's also very ignorant to assume that the kernel knows everything about the
> system and is capable of decoding errors to the satisfaction of userland.
> ÂAs Duncan Laurie pointed out (https://lkml.org/lkml/2011/1/11/390) we care
> about not only the physical address, but which stick and which dimm *chip*
> on the stick is having problems. ÂIn-kernel abstractions Âbreak down due to
> the following:

This. Andi was trying to use DMI tables to decode physical address to
DIMMs, but I'll tell you this: I have yet to see a platform that has
THAT MUCH information in the DMI tables and have it be *correct*.

>
> Â * The kernel couldn't possible know how my i2c busses are setup and the
> SPD EEPROMs are related to the physical memory abstraction that the bios
> sets up for me. ÂI don't know of any standard way to have the BIOS expose
> this sort of information to the operating system. ÂThis sort of layout
> changes between motherboard spins quite frequently as well, so good luck
> mapping it yourself in any generic way.
>
> Â * The kernel couldn't know how to map SPD JEDEC Manufacturer ID, Model
> part number and revision to anything useful about the chips themselves.
>
> Â * The kernel also couldn't know how to communicate with the AMBs in a
> meaningful way (if present).
>
>
> At the end of the day, Â The only things I really care about are:
>
> Â * I don't care if the kernel pre-processes the data it gets from the
> hardware when there is an error. ÂFor most users, burping something out to
> the logs in decoded form is generally useful. ÂIt isn't for us.
> Â * Don't ever put the kernel in a position where it will spam the logs and
> wedge the system -- even if the hardware is wonky.

I'll add to this - sometimes 100 MCEs/second is acceptable. The
Kernel needs to not flake out under that.

> Â * Don't dummy the data such that I can't do the same calculations with
> better visibility from userland.

This. We do extensive analysis of data in userland.

> Â * Don't ever enforce a reactive policy that can't be changed from
> userland.
> Â * I don't care whether the data comes from netlink, /dev/mcelog,
> whiz-bang-sysfs uevent, or thingamaboo perfevents doohickie: as long as I
> get events that are both atomic+consistent and the ABI is maintained.

I've been asking for hardware events for ever. I seem to recall a
proposal from IBM at OLS 2002 or 2003 where this was discussed. I
wanted it then, and I still want it. But I don't just want MCEs. Why
can I not use the same channel to get PCI errors or SATA errors or
EDAC (non-MCE) errors.

I don't care what the channel is, so long as I can rate-limit
(/dev/mcelog is pretty good at that) events and the events I read
contain full details about what happened.

> I've CCed Robert who owns our userland bits as he may have something to add.
>
> That said, I'd love to have generic NMI-safe data-passing for improved
> debugability, regardless of this conflated bickering about RAS
> infrastructure :)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at Âhttp://vger.kernel.org/majordomo-info.html
> Please read the FAQ at Âhttp://www.tux.org/lkml/
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/