RE: [PATCH] RAS: Add a tracepoint for reporting memory controllerevents

From: Luck, Tony
Date: Fri Jun 01 2012 - 14:22:58 EST


> This is why I'm advocating the userspace - you can implement almost
> anything there - we only need the kernel to be as thin and as fast when
> reporting those errors so that we can have the most reliable and full
> info as possible. The kernel's job is only to report as many errors
> as it possibly can so that userspace can create a good picture of the
> situation.

I'm with you on this. Userspace is the right place to analyze and set
policy for actions.

But we need to make sure that user space can actually run. That's the
motivation behind the CMCI disable patches. Since Intel broadcasts CMCI
to all cpus on a socket - a CMCI storm on a single socket machine will
stop any user code from running.

I'd make one small change to what you said:

The kernel's job is to report enough error information that user space
can make an accurate assessment of the source of the error.

I.e. "enough" is less than "as many errors as it possibly can".

-Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/