Re: [PATCH 0/6] Add a per-dimm structure

From: Borislav Petkov
Date: Sun Mar 11 2012 - 07:35:04 EST


On Fri, Mar 09, 2012 at 04:46:53PM -0300, Mauro Carvalho Chehab wrote:

[..]

> > Right, what I mean is that the rank?/ already contains some info:
> >
> > rank0/
> > |-- dimm_dev_type
> > |-- dimm_edac_mode
> > |-- dimm_label
> > |-- dimm_location
> > |-- dimm_mem_type
> > `-- dimm_size
> >
> > Now, we do the CE/UE error counting on a per-rank granularity anyway, so
> > the most natural way to have that is to add those counts to the ranks:
> >
> > rank0/
> > |-- dimm_dev_type
> > |-- dimm_edac_mode
> > |-- dimm_label
> > |-- dimm_location
> > |-- dimm_mem_type
> > |-- CE
> > |-- UE
> > `-- dimm_size
> >
> > And this has to be _very_ easy to do without any adding additional
> > sysfs nodes with ugly names to /sys/devices/system/edac etc. This is
> > even better grouping than the mc?/-based hierarchy I suggested above,
> > actually.
>
> Agreed. Yeah, it is easy to add CE/UE there. I actually implemented it
> on one of my internal patches, but there's an issue:
>
> The typical case for UE is to report errors by cacheline (128 bits), and
> not by DIMM. This happens on all FB-DIMM memory controllers, and also on
> several CS-based ones.
>
> For example, this is how (currently) the amd64_handle_ue() handles an
> Uncorrected Error:
>
> error_address_to_page_and_offset(sys_addr, &page, &offset);
> edac_mc_handle_ue(log_mci, page, offset, csrow, EDAC_MOD_STR);
>
> There's no channel info there.

Right, this looks like a largely untested path which has been that way
since forever. But, since UEs generally cause the machine to syncflood
and warm reset (now, at least), I don't think it makes a whole lot of
sense to even have such a counter - if we did, it would either say 0 or
1 :).

So, I'd suggest the UE counter to be optional and to let the driver
decide whether it wants it or not.

[..]

> One alternative would simply to remove all those intermediate
> counters, letting userspace to count the errors via perf (provided
> that we have a proper location field).

Yes, that would be where we want to go eventually because I too don't
see any reason for those counters. Besides, they don't decay over time,
for example, say you have a DIMM which experiences a temporary failure
and generates k CEs. Then, the source of that error disappears and the
DIMM works fine for months.

Now, when you look at the counters, you'll still see k CEs in one of its
ranks which doesn't tell you when those errors happened and what their
rate was, etc.

So, I'm fine with dropping those counters since they don't give you the
flexibility of a userspace tool and they don't work properly anyway.

HOWEVER, I don't know who uses them still so probably a deprecation
warning is in order here...


>
> Regards,
> Mauro
>

--
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/