RE: [PATCH] EDAC/versal: Report PFN and page offset for DDR errors

From: Datta, Shubhrajyoti

Date: Mon Apr 27 2026 - 02:28:16 EST


Public

> -----Original Message-----
> From: Srivatsa S. Bhat <srivatsa@xxxxxxxxxxxxx>
> Sent: Wednesday, April 22, 2026 9:20 AM
> To: Datta, Shubhrajyoti <shubhrajyoti.datta@xxxxxxx>
> Cc: linux-kernel@xxxxxxxxxxxxxxx; linux-edac@xxxxxxxxxxxxxxx; git (AMD-Xilinx)
> <git@xxxxxxx>; ptsm@xxxxxxxxxxxxxxxxxxx; shubhrajyoti.datta@xxxxxxxxx;
> Borislav Petkov <bp@xxxxxxxxx>; Tony Luck <tony.luck@xxxxxxxxx>
> Subject: Re: [PATCH] EDAC/versal: Report PFN and page offset for DDR errors
>
> Caution: This message originated from an External Source. Use proper caution
> when opening attachments, clicking links, or responding.
>
>
> On Wed, Apr 15, 2026 at 11:32:39AM +0530, Shubhrajyoti Datta wrote:
> > Currently, DDRMC correctable and uncorrectable error events are
> > reported to EDAC with page frame number (pfn) and offset set to zero.
> > This information is not useful to locate the address for memory errors.
> >
> > Compute the physical address from the error information and extract
> > the page frame number and offset before calling edac_mc_handle_error().
> > This provides the actual memory location information to the userspace.
> >
> > Fixes: 6f15b178cd63 ("EDAC/versal: Add a Xilinx Versal memory
> > controller driver")
> > Signed-off-by: Shubhrajyoti Datta <shubhrajyoti.datta@xxxxxxx>
> > ---
> >
> > drivers/edac/versal_edac.c | 36 +++++++++++++++++-------------------
> > 1 file changed, 17 insertions(+), 19 deletions(-)
> >
> > diff --git a/drivers/edac/versal_edac.c b/drivers/edac/versal_edac.c
> > index 5a43b5d43ca2..18045f96610e 100644
> > --- a/drivers/edac/versal_edac.c
> > +++ b/drivers/edac/versal_edac.c
> > @@ -414,34 +414,32 @@ static unsigned long convert_to_physical(struct
> > edac_priv *priv, union ecc_error static void handle_error(struct
> > mem_ctl_info *mci, struct ecc_status *stat) {
>
> [...]
>
> > if (stat->error_type == XDDR_ERR_TYPE_CE) {
>
> [...]
>
> > + } else if (stat->error_type == XDDR_ERR_TYPE_UE) {
>
> [...]
> > + } else {
> > + return;
>
> I like the cleanup contributed by this patch (in terms of reducing code
> duplication) in addition to the actual fix. However, this patch also introduces a
> subtle behavior change - the existing code calls
> memset() to clear out the ecc_status struct unconditionally, but this patch
> doesn't call memset if the error type is not CE or UE (i.e., in the early return
> path).
>
> Was this change intentional? Wouldn't it potentially cause stale data to be left
> over in the ecc_status struct, affecting future reuse?

Hi Srivatsa,
Thanks for the review.

The early return in handle_error is intentionally safe. get_error_info is always called before handle_error:

if (get_error_info(priv))
return;
handle_error(mci, &priv->stat);

get_error_info already guards against non-CE/UE events:

if (!eccr0_ceval && !eccr1_ceval && !eccr0_ueval && !eccr1_ueval)
return 1;

So handle_error is only ever reached when at least one CE or UE error is present, making the else { return; } branch unreachable in practice. The removed memset in that path was overprotective dead code.

For future events, get_error_info always reloads the data fresh before handle_error runs, so there is no risk of stale data.

I can add a comment above handle_error noting this precondition if that would help future readers.

Regards,
Shubhrajyoti


>
> [...]
>
> > +
> > memset(stat, 0, sizeof(*stat));
> > }
> >
>
> If the expectation is to actually clear it out unconditionally, it would be great to
> document it in the comments (if not done already).
>
> Thank you!
>
> Regards,
> Srivatsa
> Microsoft Linux Systems Group