Re: [patch 1/2] x86_64 page fault NMI-safe

From: Mathieu Desnoyers
Date: Wed Jul 14 2010 - 13:06:26 EST


* Linus Torvalds (torvalds@xxxxxxxxxxxxxxxxxxxx) wrote:
> On Wed, Jul 14, 2010 at 8:49 AM, Mathieu Desnoyers
> <mathieu.desnoyers@xxxxxxxxxxxx> wrote:

(I was quoting Peter Anvin below) ;)

> >> I think you're vastly overestimating what is sane to do from an NMI
> >> context.  It is utterly and totally insane to assume vmalloc is available
> >> in NMI.
>
> I agree that NMI handlers shouldn't touch vmalloc space. But now that
> percpu data is mapped through the VM, I do agree that other CPU's may
> potentially need to touch that data, and an interrupt (including an
> NMI) might be the first to create the mapping.
>
[...]
> So please just document the sequence that actually needs the page
> table setup for the NMI/percpu case.
>
> This patch (1/2) doesn't look horrible per se. I have no problems with
> it. I just want to understand why it is needed.

The problem originally addressed by this patch is the case where a NMI handler
try to access vmalloc'd per-cpu data, which goes as follow:

- One CPU does a fork(), which copies the basic kernel mappings.
- Perf allocates percpu memory for buffer control data structures.
This mapping does not get copied.
- Tracing is activated.
- switch_to() to the newly forked process which missed the new percpu
allocation.
- We take a NMI, which touches the vmalloc'd percpu memory in the Perf tracing
handler, therefore leading to a page fault in NMI context. Here, we might be
in the middle of switch_to(), where ->current might not be in sync with the
current cr3 register.

The three choices we have to handle this that I am aware of are:
1) supporting page faults in NMI context, which imply removing ->current
dependency and supporting iret-less return path.
2) duplicating the percpu alloc API with a variant that maps to kmalloc.
3) using vmalloc_sync_all() after creating the mapping. (only works for x86_64,
not x86_32).

Choice 3 seems like a no-go on x86_32, choice 2 seems like a last-resort
(involves API duplication and reservation of a fixed-amount of per-cpu memory at
boot). Hence the proposal of choice 1.

Thanks,

Mathieu

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/