Re: [PATCH] x86: Prevent oops with >16 memory controllers

From: Borislav Petkov
Date: Mon Feb 16 2015 - 06:40:53 EST


On Sat, Feb 14, 2015 at 11:18:40AM +0800, Daniel J Blueman wrote:
> When ECC interrupts occur on memory controllers after EDAC_MAX_MCS (16), the

I knew this artificial limit would come back to bite us someday :-\

> kernel fatally dereferences unallocated structures [1]; this occurs on at
> least NumaConnect systems.
>
> Minimally fix by checking if a memory controller info structure is allocated;
> candidate for stable.
>
> Signed-off-by: Daniel J Blueman <daniel@xxxxxxxxxxxxx>
>
> -- [1]
>
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000320
> IP: [<ffffffff819f714f>] decode_bus_error+0x2f/0x2b0
> PGD 2f8b5a3067 PUD 2f8b5a2067 PMD 0
> Oops: 0000 [#2] SMP
> Modules linked in:
> CPU: 224 PID: 11930 Comm: stream_c.exe.gn Tainted: G D 3.19.0 #1

CPU 224?! What node is that? :)

> ---
> drivers/edac/amd64_edac.c | 7 ++++++-
> 1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
> index 17638d7..baccc0e 100644
> --- a/drivers/edac/amd64_edac.c
> +++ b/drivers/edac/amd64_edac.c
> @@ -2175,7 +2175,7 @@ static void __log_bus_error(struct mem_ctl_info *mci, struct err_info *err,
> static inline void decode_bus_error(int node_id, struct mce *m)
> {
> struct mem_ctl_info *mci = mcis[node_id];
> - struct amd64_pvt *pvt = mci->pvt_info;
> + struct amd64_pvt *pvt;
> u8 ecc_type = (m->status >> 45) & 0x3;
> u8 xec = XEC(m->status, 0x1f);
> u16 ec = EC(m->status);
> @@ -2190,6 +2190,11 @@ static inline void decode_bus_error(int node_id, struct mce *m)
> if (xec && xec != F10_NBSL_EXT_ERR_ECC)
> return;
>
> + /* Unable to decode on memory controllers after EDAC_MAX_MCS, as no mci is allocated */
> + if (!mci)
> + return;
> + pvt = mci->pvt_info;

Hmm, so we have all the facilities to fix that properly, IINM:
edac_mc_find(), add_mc_to_global_list() and so on.

Would looking through the list of the memory controllers help instead,
i.e. if you do:

static inline void decode_bus_error(int node_id, struct mce *m)
{
struct mem_ctl_info *mci = edac_mc_find(node_id);
if (!mci)
return;

?

Then we can get rid of that local mcis dumbness and do it properly...

Thanks.

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/