Re: [PATCH v2 1/2] mce: acpi/apei: Honour Firmware First for MCAbanks listed in APEI HEST CMC

From: Naveen N. Rao
Date: Fri Jun 21 2013 - 05:32:27 EST


On 06/21/2013 02:06 PM, Borislav Petkov wrote:
On Fri, Jun 21, 2013 at 01:16:50PM +0530, Naveen N. Rao wrote:
Yes, but I'm afraid this won't work either - mce_banks_owned is
cleared during cpu offline. This is necessary since a cmci
rediscover is triggered on cpu offline, so that if this bank is
shared across cores, a different cpu can claim ownership of this
bank.

What for? Sounds strange to me.

Look at section "15.5.1 CMCI Local APIC Interface" from Intel SDM Vol. 3, and the subsequent section on "System Software Recommendation for Managing CMCI and Machine Check Resources":
"For example, if a corrected bit error in a cache shared by two logical processors caused a CMCI, the interrupt will be delivered to both logical processors sharing that microarchitectural sub-system."

In other words, some of the MC banks are shared across logical cpus in a core and some across all cores in a package. During initialization, the first cpu in a core ends up owning most of the banks specific to the core/package. When this cpu is offlined, we would want the second cpu in that core to discover and enable CMCI for those MC banks which it shares with the first cpu.

As an example, consider a hypothetical single-core Intel processor with Hyperthreading. On init, let's say the first cpu ends up owning banks 1, 2, 3 and 4; and the second cpu ends up owning banks 1 and 2. This would mean that MC banks 1 and 2 are "hyperthread"-specific, while banks 3 and 4 are shared. Now, if we offline the first cpu, it disables CMCI on all 4 banks. However, banks 3 and 4 are shared. So, if we now do a cmci rediscovery, the second cpu will see that banks 3 and 4 don't have CMCI enabled and will then claim ownership of those so that we can continue to receive and process CMCIs from those subsystems.

Makes sense now?


Thanks,
Naveen

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/