Re: [PATCH -v10 0/4] Lock-less list

From: Borislav Petkov
Date: Thu Jan 20 2011 - 08:37:04 EST


+ Tony.

On Thu, Jan 20, 2011 at 02:06:25PM +0100, Ingo Molnar wrote:
>
> * huang ying <huang.ying.caritas@xxxxxxxxx> wrote:
>
> > On Thu, Jan 20, 2011 at 8:14 PM, Ingo Molnar <mingo@xxxxxxx> wrote:
> > >
> > > * huang ying <huang.ying.caritas@xxxxxxxxx> wrote:
> > >
> > >> > But will all that stuff be accepted? Please stop sending infrastructure bits and
> > >> > focus on your larger RAS picture, once you have consensus on that from all
> > >> > parties involved, then, and only then, does it make sense to submit everything,
> > >> > including infrastructure.
> > >>
> > >> I am not sending hardware error reporting infrastructure. ÂAs far as I know, Linus
> > >> and Andrew suggest to use printk for hardware error reporting. ÂAnd now, I just
> > >> try to write APEI driver and reporting hardware error with printk. ÂIs it
> > >> acceptable? ÂDo you have some other idea about hardware error reporting?
> > >
> > > Erm, how could you possible have missed the perf based RAS daemon work of Boris,
> > > which we've pointed out about half a dozen times already?
> >
> > Even if there is some other hardware error reporting infrastructure
> > such as perf based, I think we still need printk too. After all, as
> > Linus pointed out, printk is the most popular error reporting
> > mechanism so far. Do you think so?
>
> Of course, that's why the upstream EDAC code uses printk too. In fact it does all
> sorts of in-kernel decoding to make the printk output more useful - the /dev/mcelog
> method of pushing all decoding to user-space is fundamentally flawed.

True story. And yet google folk still do that, unfortunately:
https://lkml.org/lkml/2011/1/10/419

I think printk should be used in the most cases, where the home user
runs Linux on his machine and it freezes and when he tries to catch the
MCA info, he simply collects serial console or with a persistent storage
device in place, he reboots and then reads out the exact decoded error.

In the big data center, printk might not be that useful anymore and we
might want to have structured log error data - still decoded, mind you,
and properly formatted but sent to userspace over perf and then over
the network or collected by a userspace daemon doing policy decisions
and error trends evaluation. This is, I think, much saner approach than
collecting hardware info from every machine and then using it to decode
the errors. We still need a bunch of work in that direction though.

> So yes, printk is the primary output channel and having a readable printk output
> pretty much overrides any other concern.
>
> But that is not what you are doing. I get the impression that you are using printk
> as an _excuse_ to not have to work with the RAS people and run some parallel
> framework so that you do not have to work with them or listen to them. It is rather
> counter-productive. Working together is useful.

So yeah, let me reiterate what Andrew and Ingo said: I don't want to
discuss the merits of all those cool lockless software thingies that
could replace this and that and would be cool if someone used them. I'm
only interested if they can help in a real-world use case - otherwise
it's just a programming exercise.

Yeah, this all IMHO, of course :).

--
Regards/Gruss,
Boris.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/