Re: 2.6.32-rc8: amd64_edac slub error

From: Borislav Petkov
Date: Tue Dec 01 2009 - 10:17:03 EST


> EDAC DEBUG: in drivers/edac/amd64_edac.c, line at 2367: DRAM MEM-CTL PCI Bus ID: 0000:00:18.2
> EDAC DEBUG: in drivers/edac/amd64_edac.c, line at 2369: Misc device PCI Bus ID: 0000:00:18.3
> calling alsa_pcm_init+0x0/0x71 [snd_pcm] @ 1402
> initcall alsa_pcm_init+0x0/0x71 [snd_pcm] returned 0 after 17 usecs
> EDAC amd64: ECC is enabled by BIOS.
> get_cpus_on_this_dct_cpumask: nid: 0, cpu: 0
> get_cpus_on_this_dct_cpumask: nid: 0, cpu: 2
> amd64_nb_mce_bank_enabled_on_node: weight: 2
> EDAC DEBUG: in drivers/edac/amd64_edac.c, line at 2776: core: 0, MCG_CTL: 0x1f, NB MSR is enabled
> EDAC DEBUG: in drivers/edac/amd64_edac.c, line at 2776: core: 2, MCG_CTL: 0x0, NB MSR is disabled
> =============================================================================
> BUG kmalloc-16: Redzone overwritten
> -----------------------------------------------------------------------------

Hmm, I think I know what happens. This machine has non-contigious
core enumeration on a node (e.g. 0,2 on node 0 instead of 0,1) but
rdmsr_on_cpus assumes the former. Therefore we write outside of the
allocated msrs struct and thus the redzone overwrite. Here's a simple
fix that should take care of it. Please apply on top of the debugging
patch and catch the output again so that we could verify it.

I'll fix this properly when I get back and then maybe even backport it
depending on the intrusiveness of the changes.

Thanks.

---
drivers/edac/amd64_edac.c | 15 +++++++++++++--
1 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index 139bc14..c013261 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -2750,7 +2750,8 @@ static bool amd64_nb_mce_bank_enabled_on_node(int nid)
{
cpumask_t mask;
struct msr *msrs;
- int cpu, nbe, idx = 0;
+ int cpu, nbe, i, idx = 0;
+ int first_cpu, last_cpu = 0;
bool ret = false;

cpumask_clear(&mask);
@@ -2759,7 +2760,17 @@ static bool amd64_nb_mce_bank_enabled_on_node(int nid)

pr_err("%s: weight: %d\n", __func__, cpumask_weight(&mask));

- msrs = kzalloc(sizeof(struct msr) * cpumask_weight(&mask), GFP_KERNEL);
+ /*
+ * calc. cores interval when non-contigious core enumeration
+ */
+ first_cpu = cpumask_first(&mask);
+
+ for (i = first_cpu; i < nr_cpu_ids; i++)
+ if (cpumask_test_cpu(i, &mask))
+ last_cpu = i;
+
+ msrs = kzalloc(sizeof(struct msr) * (last_cpu - first_cpu + 1),
+ GFP_KERNEL);
if (!msrs) {
amd64_printk(KERN_WARNING, "%s: error allocating msrs\n",
__func__);
--
1.6.4.3

--
Regards/Gruss,
Boris.

Operating | Advanced Micro Devices GmbH
System | Karl-Hammerschmidt-Str. 34, 85609 Dornach b. München, Germany
Research | Geschäftsführer: Andrew Bowd, Thomas M. McCoy, Giuliano Meroni
Center | Sitz: Dornach, Gemeinde Aschheim, Landkreis München
(OSRC) | Registergericht München, HRB Nr. 43632

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/