Re: [RFC 0/9] mce recovery for Sandy Bridge server

From: Ingo Molnar
Date: Wed May 25 2011 - 09:44:51 EST

Next message: Michal Marek: "Re: [PATCH 4/6] kbuild: Move firmware images to a separate RPM"
Previous message: Bruce Stenning: "RE: sata_mv port lockup on hotplug (kernel 2.6.38.2)"
In reply to: Joerg Roedel: "Re: [RFC 0/9] mce recovery for Sandy Bridge server"
Next in thread: Tony Luck: "Re: [RFC 0/9] mce recovery for Sandy Bridge server"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

* Luck, Tony <tony.luck@xxxxxxxxx> wrote:

> Other points noted - I'll go look at the previous discussion threads
> you gave me links to.
>
> I do want to comment on this point:
> > Creating a callback there would be a good place to do the TIF_MCE work and also
> > to extract any events that got queued by other NMIs. Note that more events
> > might be queued by further NMIs while we are processing the MCE path - while
> > with the task->mce_error_pfn hack we are limited to a single pending event only
> > and subsequent NMIs will overwrite this value!
>
> I wasn't very happy with task->mce_error_pfn either - but being
> overwritten is not one of its flaws. The task that stumbled on the
> error must not be run until the error is dealt with - any other
> NMIs for other errors must be happening to other tasks (who have
> their own task->mce_error_pfn).

If that task is running right now you might not have much of a chance
but to be prepared for a repeat NMI the moment NMIs are enabled again
after the IRET.

So repeat errors are not just

But yeah, this is not the worst ugliness of it indeed.

> > A happy side effect is that the TIF_MCE_NOTIFY hack could go away
> > as well.
>
> We need some way to stop the task that found the error dead in its
> tracks - if it tripped over a data error, then running it will just
> trip over the same error again. If it had a memory error during an
> instruction fetch we have no place to return to.

Well, the primary thing TIF_MCE_NOTIFY does is a roundabout way to
iterate through repeat calls to memory_failure(), with all pfns that
got buffered so far.

We already have a generic facility to do such things at
return-to-userspace: _TIF_USER_RETURN_NOTIFY.

Btw., the SIGKILL logic is probably overcomplicated: when it's clear
that user-space can not recover why not do a do_exit() and be done
with it? As long as it's called from a syscall level codepath and no
locks/resources are held do_exit() can be called.

> So can we talk about this part for a while before returning to the
> "how to report this" discussion?
>
> So here's the situation - we are in the NMI handler when we find
> from looking at the machine check bank registers that we have a
> recoverable error. We know the physical address, and we know the
> task (which might have been in user or kernel context). I can
> package that information into a perf/event ... but then how can I
> mark the current task as not-fit-for-execution?

Well, you'd queue up a user return notifier, process the ring-buffer
at that end (as a ring-buffer consumer), call policy action like
memory_failure() which does its MM specific checks to see whether the
task is recoverable or not.

The goal is that we wouldnt lose any existing hwpoison functionality
over what hwpoison does today - we'd *gain* functionality:

- through the use of enumerated events we would give access to
*other* users of the same event source as well, not just hwpoison.

- we'd also eventually remove the mcelog ring-buffer mechanism
itself.

For this i'd suggest to take a look at Frederic's recent patches on
lkml which decouple the events code some more:

Subject: [PATCH v2] hw_breakpoint: Let the user choose not to build it (and perf too)

Those could be compounded with more decoupling of perf events: for
example right now 'struct perf_event' is pretty big, because it
carries hw PMU details as well. But those details are not needed for
non-PMU events so splitting those from struct perf_event would
streamline this code some more.

Likewise the kernel/events/core.c ring-buffer bits could be split out
into kernel/events/ring-buffer.c - i think Frederic might be working
on this one already.

More advanced steps in the MCE area would be to replace the
'severity' logic (which is really a poor-man's filter) with event
filters and panic the box (or do other immediate action) if the
filter matches.

This again would not lose any functionality that the severity bits
give us today, we'd gain functionality:

- the conditions in filter expressions are pretty flexible so we
could do more than the current stack of severity bitmask ops. For
example a system could be configured to ignore the first N
non-fatal messages but panic the box if more than 10 errors were
generated. If there's a "message_seq_nr" field available in the
TRACE_EVENT() then this would be a simple "message_seq_nr >= 10"
filter condition. With the current severity code this would have
to be coded in, the ABI extended, etc. etc.

Although initially it would probably want to match what the severity
filters do today and install all the default policy.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Michal Marek: "Re: [PATCH 4/6] kbuild: Move firmware images to a separate RPM"
Previous message: Bruce Stenning: "RE: sata_mv port lockup on hotplug (kernel 2.6.38.2)"
In reply to: Joerg Roedel: "Re: [RFC 0/9] mce recovery for Sandy Bridge server"
Next in thread: Tony Luck: "Re: [RFC 0/9] mce recovery for Sandy Bridge server"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]