Re: [patch] Remove limit on MCA recoveries

From: Russ Anderson
Date: Mon Jan 17 2005 - 14:28:41 EST


Matthias Fouquet-Lapar wrote:
> Keith Owens wrote:
> > Russ Anderson <rja@xxxxxxx> wrote:
> > >The MCA recovery driver saves the addresses of memory errors
> > >in an array. The array has 32 entries. The effect is
> > >that after 32 recoveries, the driver stops recovering.
> > >
> > >This patch removes the page_isolate array. Since the array
> > >was only used to see if the page is already marked reserved,
> > >check the reserved bit instead of the array.
> >
> > lkcd dumps kernel pages marked reserved, so lkcd will try to process
> > isolated pages. We will eventually need to add a new page flag to mark
> > faulty pages.
>
> Probably any other dump mechanism should be aware of bad HW pages as well,
> so we might be better off to add a flag right away. While we are at it I
> would propose to have actually two flags :
>
> - hard error (which will cause a MCA and should be skipped when taking
> a system dump)
> - soft error (page has encountered SBE, so we might want to avoid future
> allocation, but it can be dumped without causing an MCA)

Yes, there should be page->flags to indicate hard and soft memory errors,
such as PG_hard_error and PG_soft_error. The dump code could look at those
flags.

include/linux/page-flags.h has PG_error for indicating I/O errors, which
is close but not quite what we need, given the way it is used.

CCing linux-kernel since the flags are not ia64 specific.

--
Russ Anderson, OS RAS/Partitioning Project Lead
SGI - Silicon Graphics Inc rja@xxxxxxx
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/