Re: NMI between switch_mm and switch_to

From: Ingo Molnar
Date: Mon Aug 03 2009 - 06:43:18 EST



* Paul Mackerras <paulus@xxxxxxxxx> wrote:

> Ingo Molnar writes:
>
> > * Peter Zijlstra <a.p.zijlstra@xxxxxxxxx> wrote:
> >
> > > On Tue, 2009-07-28 at 14:49 +1000, Paul Mackerras wrote:
> > >
> > > > Ben H. suggested there might be a problem if we get a PMU
> > > > interrupt and try to do a stack trace of userspace in the
> > > > interval between when we call switch_mm() from
> > > > sched.c:context_switch() and when we call switch_to(). If we
> > > > get an NMI in that interval and do a stack trace of userspace,
> > > > we'll see the registers of the old task but when we peek at user
> > > > addresses we'll see the memory image for the new task, so the
> > > > stack trace we get will be completely bogus.
> > > >
> > > > Is this in fact also a problem on x86, or is there some subtle
> > > > reason why it can't happen there?
> > >
> > > I can't spot one, maybe Ingo can when he's back :-)
> > >
> > > So I think this is very good spotting from Ben.
> >
> > Yeah.
> >
> > > We could use preempt notifiers (or put in our own hooks) to
> > > disable callchains during the context switch I suppose.
> >
> > I think we should only disable user call-chains i think - the
> > in-kernel call-chain is still reliable.
> >
> > Also, i think we dont need preempt notifiers, we can use a simple
> > check like this:
> >
> > if (current->mm &&
> > cpu_isset(smp_processor_id(), &current->mm->cpu_vm_mask) {
>
> On x86, do you clear the current processor's bit in cpu_vm_mask
> when you switch the MMU away from a task? We don't on powerpc,
> which would render the above test incorrect. (But then we don't
> actually have the problem on powerpc since interrupts get
> hard-disabled in switch_mm and stay hard-disabled until they get
> soft-enabled.)

This is what x86 does in arch/x86/include/asm/mmu_context.h:

static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
struct task_struct *tsk)
{
unsigned cpu = smp_processor_id();

if (likely(prev != next)) {
/* stop flush ipis for the previous mm */
cpu_clear(cpu, prev->cpu_vm_mask);
#ifdef CONFIG_SMP
percpu_write(cpu_tlbstate.state, TLBSTATE_OK);
percpu_write(cpu_tlbstate.active_mm, next);
#endif
cpu_set(cpu, next->cpu_vm_mask);

/* Re-load page tables */
load_cr3(next->pgd);

/*
* load the LDT, if the LDT is different:
*/
if (unlikely(prev->context.ldt != next->context.ldt))
load_LDT_nolock(&next->context);
}
#ifdef CONFIG_SMP
else {
percpu_write(cpu_tlbstate.state, TLBSTATE_OK);
BUG_ON(percpu_read(cpu_tlbstate.active_mm) != next);

if (!cpu_test_and_set(cpu, next->cpu_vm_mask)) {
/* We were in lazy tlb mode and leave_mm disabled
* tlb flush IPI delivery. We must reload CR3
* to make sure to use no freed page tables.
*/
load_cr3(next->pgd);
load_LDT_nolock(&next->context);
}
}
#endif
}

which would suggest to me that cpu_vm_mask is precise.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/