Re: [PATCH v2 3/3] ie31200_edac: Introduce the driver

From: Borislav Petkov
Date: Fri Jul 04 2014 - 10:13:41 EST


On Thu, Jun 26, 2014 at 09:58:35PM +0000, Jason Baron wrote:
> Add 'ie31200_edac' driver for the E3-1200 series of Intel DRAM controllers.
> Driver is based on the following E3-1200 specs:
>
> http://www.intel.com/content/www/us/en/processors/xeon/xeon-e3-1200-family-vol-2-datasheet.html
> http://www.intel.com/content/www/us/en/processors/xeon/xeon-e3-1200v3-vol-2-datasheet.html
>
> I've tested this on bad memory hardware, and observed correlating bad reads
> and uncorrected memory errors as reported by the driver.
>
> Tested against:
>
> CPU E3-1270 v3 @ 3.50GHz : 8086:0c08 (haswell)
> CPU E3-1270 V2 @ 3.50GHz : 8086:0158 (ivy bridge)
> CPU E31270 @ 3.40GHz : 8086:0108 (sandy bridge)
>
> Signed-off-by: Jason Baron <jbaron@xxxxxxxxxx>

Applied albeit with some remarks, see below.

> diff --git a/drivers/edac/ie31200_edac.c b/drivers/edac/ie31200_edac.c
> new file mode 100644
> index 0000000..feb9edb
> --- /dev/null
> +++ b/drivers/edac/ie31200_edac.c
> @@ -0,0 +1,542 @@
> +/*
> + * Intel E3-1200
> + * Copyright (C) 2014 Jason Baron <jbaron@xxxxxxxxxx>
> + *
> + * Support for the E3-1200 processor family. Heavily based on previous
> + * Intel EDAC drivers.
> + *
> + * Since the DRAM controller is on the cpu chip, we can use its PCI device
> + * id to identify these processors.
> + *
> + * PCI DRAM controller device ids (Taken from The PCI ID Repository - http://pci-ids.ucw.cz/)
> + *
> + * 0108: Xeon E3-1200 Processor Family DRAM Controller
> + * 010c: Xeon E3-1200/2nd Generation Core Processor Family DRAM Controller
> + * 0150: Xeon E3-1200 v2/3rd Gen Core processor DRAM Controller
> + * 0158: Xeon E3-1200 v2/Ivy Bridge DRAM Controller
> + * 015c: Xeon E3-1200 v2/3rd Gen Core processor DRAM Controller
> + * 0c04: Xeon E3-1200 v3/4th Gen Core Processor DRAM Controller
> + * 0c08: Xeon E3-1200 v3 Processor DRAM Controller
> + *
> + * Based on Intel specification:
> + * http://www.intel.com/content/dam/www/public/us/en/documents/datasheets/xeon-e3-1200v3-vol-2-datasheet.pdf
> + * http://www.intel.com/content/www/us/en/processors/xeon/xeon-e3-1200-family-vol-2-datasheet.html
> + *
> + * According to the above datasheet (p.16):
> + * "
> + * 6. Software must not access B0/D0/F0 32-bit memory-mapped registers with
> + * requests that cross a DW boundary.
> + * "
> + *
> + * Thus, we make use of the explicit: lo_hi_readq(), which breaks the readq into
> + * 2 readl() calls. This restriction may be lifted in subsequent chip releases,
> + * but lo_hi_readq() ensures that we are safe across all e3-1200 processors.
> + */
> +
> +#include <linux/module.h>
> +#include <linux/init.h>
> +#include <linux/pci.h>
> +#include <linux/pci_ids.h>
> +#include <linux/edac.h>
> +
> +#include <asm-generic/io-64-nonatomic-lo-hi.h>
> +#include "edac_core.h"
> +
> +#define IE31200_REVISION "1.0"
> +#define EDAC_MOD_STR "ie31200_edac"
> +
> +#define ie31200_printk(level, fmt, arg...) \
> + edac_printk(level, "ie31200", fmt, ##arg)
> +
> +#define PCI_DEVICE_ID_INTEL_IE31200_HB_1 0x0108
> +#define PCI_DEVICE_ID_INTEL_IE31200_HB_2 0x010c
> +#define PCI_DEVICE_ID_INTEL_IE31200_HB_3 0x0150
> +#define PCI_DEVICE_ID_INTEL_IE31200_HB_4 0x0158
> +#define PCI_DEVICE_ID_INTEL_IE31200_HB_5 0x015c
> +#define PCI_DEVICE_ID_INTEL_IE31200_HB_6 0x0c04
> +#define PCI_DEVICE_ID_INTEL_IE31200_HB_7 0x0c08
> +
> +#define IE31200_DIMMS 4
> +#define IE31200_RANKS 8
> +#define IE31200_RANKS_PER_CHANNEL 4
> +#define IE31200_DIMMS_PER_CHANNEL 2
> +#define IE31200_CHANNELS 2
> +
> +/* Intel IE31200 register addresses - device 0 function 0 - DRAM Controller */
> +#define IE31200_MCHBAR_LOW 0x48
> +#define IE31200_MCHBAR_HIGH 0x4c
> +#define IE31200_MCHBAR_MASK GENMASK_ULL(38, 15)
> +#define IE31200_MMR_WINDOW_SIZE (1 << 15)
> +
> +/*
> + * Error Status Register (16b)
> + *
> + * 15 reserved
> + * 14 Isochronous TBWRR Run Behind FIFO Full
> + * (ITCV)
> + * 13 Isochronous TBWRR Run Behind FIFO Put
> + * (ITSTV)
> + * 12 reserved
> + * 11 MCH Thermal Sensor Event
> + * for SMI/SCI/SERR (GTSE)
> + * 10 reserved
> + * 9 LOCK to non-DRAM Memory Flag (LCKF)
> + * 8 reserved
> + * 7 DRAM Throttle Flag (DTF)
> + * 6:2 reserved
> + * 1 Multi-bit DRAM ECC Error Flag (DMERR)
> + * 0 Single-bit DRAM ECC Error Flag (DSERR)
> + */
> +#define IE31200_ERRSTS 0xc8
> +#define IE31200_ERRSTS_UE BIT(1)
> +#define IE31200_ERRSTS_CE BIT(0)
> +#define IE31200_ERRSTS_BITS (IE31200_ERRSTS_UE | IE31200_ERRSTS_CE)
> +
> +/*
> + * Channel 0 ECC Error Log (64b)
> + *
> + * 63:48 Error Column Address (ERRCOL)
> + * 47:32 Error Row Address (ERRROW)
> + * 31:29 Error Bank Address (ERRBANK)
> + * 28:27 Error Rank Address (ERRRANK)
> + * 26:24 reserved
> + * 23:16 Error Syndrome (ERRSYND)
> + * 15: 2 reserved
> + * 1 Multiple Bit Error Status (MERRSTS)
> + * 0 Correctable Error Status (CERRSTS)
> + */
> +#define IE31200_C0ECCERRLOG 0x40c8
> +#define IE31200_C1ECCERRLOG 0x44c8
> +#define IE31200_ECCERRLOG_CE BIT(0)
> +#define IE31200_ECCERRLOG_UE BIT(1)
> +#define IE31200_ECCERRLOG_RANK_BITS GENMASK_ULL(28, 27)
> +#define IE31200_ECCERRLOG_RANK_SHIFT 27
> +#define IE31200_ECCERRLOG_SYNDROME_BITS GENMASK_ULL(23, 16)
> +#define IE31200_ECCERRLOG_SYNDROME_SHIFT 16
> +
> +#define IE31200_ECCERRLOG_SYNDROME(log) \
> + ((log & IE31200_ECCERRLOG_SYNDROME_BITS) >> \
> + IE31200_ECCERRLOG_SYNDROME_SHIFT)
> +
> +#define IE31200_CAPID0 0xe4
> +#define IE31200_CAPID0_PDCD BIT(4)
> +#define IE31200_CAPID0_DDPCD BIT(6)
> +#define IE31200_CAPID0_ECC BIT(1)
> +
> +#define IE31200_MAD_DIMM_0_OFFSET 0x5004
> +#define IE31200_MAD_DIMM_SIZE GENMASK_ULL(7, 0)
> +#define IE31200_MAD_DIMM_A_RANK BIT(17)
> +#define IE31200_MAD_DIMM_A_WIDTH BIT(19)
> +
> +#define IE31200_PAGES(n) \
> + (n << (28 - PAGE_SHIFT))

I realigned all those defines, see attached final patch.

...

> +static int ie31200_probe1(struct pci_dev *pdev, int dev_idx)
> +{
> + int rc;
> + int i, j;
> + struct mem_ctl_info *mci = NULL;
> + struct edac_mc_layer layers[2];
> + struct dimm_data dimm_info[IE31200_CHANNELS][IE31200_DIMMS_PER_CHANNEL];
> + void __iomem *window;
> + struct ie31200_priv *priv;
> + u32 addr_decode;
> +
> + edac_dbg(0, "MC:\n");
> +
> + if (!ecc_capable(pdev)) {
> + ie31200_printk(KERN_INFO, "No ECC support\n");
> + return -ENODEV;
> + }
> +
> + window = ie31200_map_mchbar(pdev);
> + if (!window)
> + return -ENODEV;
> +
> + /* populate DIMM info */
> + for (i = 0; i < IE31200_CHANNELS; i++) {
> + addr_decode = readl(window + IE31200_MAD_DIMM_0_OFFSET +
> + (i * 4));
> + edac_dbg(0, "addr_decode: 0x%x\n", addr_decode);
> + for (j = 0; j < IE31200_DIMMS_PER_CHANNEL; j++) {
> + dimm_info[i][j].size = (addr_decode >> (j * 8)) &
> + IE31200_MAD_DIMM_SIZE;
> + dimm_info[i][j].dual_rank = (addr_decode &
> + (IE31200_MAD_DIMM_A_RANK << j)) ? 1 : 0;
> + dimm_info[i][j].x16_width = (addr_decode &
> + (IE31200_MAD_DIMM_A_WIDTH << j)) ? 1 : 0;
> + edac_dbg(0, "size: 0x%x, rank: %d, width: %d\n",
> + dimm_info[i][j].size,
> + dimm_info[i][j].dual_rank,
> + dimm_info[i][j].x16_width);
> + }
> + }
> +
> + nr_channels = how_many_channels(pdev);
> +
> + layers[0].type = EDAC_MC_LAYER_CHIP_SELECT;
> + layers[0].size = IE31200_DIMMS;
> + layers[0].is_virt_csrow = true;
> + layers[1].type = EDAC_MC_LAYER_CHANNEL;
> + layers[1].size = nr_channels;
> + layers[1].is_virt_csrow = false;
> + mci = edac_mc_alloc(0, ARRAY_SIZE(layers), layers,
> + sizeof(struct ie31200_priv));
> +
> + rc = -ENOMEM;
> + if (!mci)
> + goto fail_unmap;

Just an idea:

You can move the edac_mc_alloc() allocation before mapping of the mchbar
so that in the failure path you can save yourself all the work you've
done in populating dimm info above needlessly.

But you can do that in a follow-on patch along with the MAINTAINERS one.
You could use the for-next branch here as a base:

git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp.git

Thanks.

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--