Re: [PATCH] tmp patch to fix hotplug issue in CMCI storm

From: Thomas Gleixner
Date: Fri Jun 15 2012 - 05:56:03 EST


On Fri, 15 Jun 2012, Chen Gong wrote:
> ä 2012/6/14 22:07, Thomas Gleixner åé:
> > On Thu, 14 Jun 2012, Chen Gong wrote:
> > > this patch is based on tip tree and previous 5 patches.
> >
> > You really don't need all this complexity to handle that. The main
> > thing is that you clear the storm state and adjust the storm counter
> > when the cpu goes offline (in case the state is ACTIVE).
> >
> > When it comes online again then you can simply let it restart cmci. If
> > the storm on this cpu (or node) still exists then it will notice and
> > everything falls in place.
>
> I ever tested some different scenarios, if storm on this cpu still
> exists, it triggers the CMCI and broadcast it on the sibling CPU,
> which means the counter *cmci_storm_on_cpus* will increase beyond
> the upper limit. E.g. on a 2 sockets SandyBridge-EP system (one socket
> has 8 cores and 16 threads), inject one error on one socket, you can
> watch *cmci_storm_on_cpus* = 16 becuase of CMCI broadcast, during
> this time, offline and online one CPU on this socket, firstly
> *cmci_storm_on_cpus* = 15 because of offline and ACTIVE status, and then
> *cmci_storm_on_cpus* = 31 in that CMCI is actived because of
> online.That's why I have to disable CMCI during whole online/offline
> until CMCI storm is subsided. Frankly, the logic is a little bit
> complex so that I write many comments to avoid I forget it after some
> time :-)

This does not make any sense at all.

What you are saying is that even if CPU0 run cmci_clear() the CMCI
raised on CPU1 will cause the CMCI vector to be triggered on CPU0.

So how does the whole storm machinery work in the following case:

CPU0 CPU1
cmci incoming cmci incoming
storm detected no storm detected yet
cmci_clear()
switch to poll

cmci raised

So according to your explanation that would cause the cmci vector to
be broadcasted to CPU0 as well. Now that would cause the counter to
get a bogus increment, right ?

So instead of hacking insane crap into the code, we have simply to do
the obvious Right Thing:

Index: linux-2.6/arch/x86/kernel/cpu/mcheck/mce_intel.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/mcheck/mce_intel.c
+++ linux-2.6/arch/x86/kernel/cpu/mcheck/mce_intel.c
@@ -119,6 +119,9 @@ static bool cmci_storm_detect(void)
unsigned long ts = __this_cpu_read(cmci_time_stamp);
unsigned long now = jiffies;

+ if (__this_cpu_read(cmci_storm_state) != CMCI_STORM_NONE)
+ return true;
+
if (time_before_eq(now, ts + CMCI_STORM_INTERVAL)) {
cnt++;
} else {

That will prevent damage under all circumstances, cpu hotplug
included.

But that's too simple and comprehensible I fear.

Thanks,

tglx