Re: irq lock inversion

From: Ingo Molnar
Date: Fri Nov 06 2009 - 02:59:23 EST



* Tejun Heo <tj@xxxxxxxxxx> wrote:

> Ingo Molnar wrote:
> >>> This warning is bogus -- sched_init() is being called very early with IRQs
> >>> disabled, and the irqsave/restore code paths in pcpu_alloc() are only for early
> >>> init. The path can never be called from irq context once the early init
> >>> finishes. Rationale for this is explained in changelog of the commit mentioned
> >>> above.
> >>>
> >>> This problem can be encountered generally in any other early code running
> >>> with IRQs off and using irqsave/irqrestore.
> >>>
> >>> Reported-by: Yinghai Lu <yhlu.kernel@xxxxxxxxx>
> >>> Signed-off-by: Jiri Kosina <jkosina@xxxxxxx>
> >> Looks good to me. Ingo, what do you think?
> >
> > Ugh, this explanation is _BOGUS_. As i said, taking a lock with irqs
> > disabled does _NOT_ mark a lock as 'irq safe' - if it did, we'd have
> > false positives left and right.
> >
> > Read the lockdep message please, consider all the backtraces it prints,
> > it says something different.
>
> Ah... okay, the pcpu_free() path is correctly marking the lock
> irqsafe. I assumed this was caused by recent pcpu_alloc() change.
> Sorry about that. The lock inversion problem has always been there,
> it just never showed up because none has use allocation map that large
> I suppose.
>
> So, the correct fix would be either 1. push down irqsafeness down to
> vmalloc locks or 2. the rather ugly unlock-lock dancing in
> pcpu_extend_area_map() I posted earlier. For 2.6.32, I guess we'll
> have to go with #2. For longer term, we'll probably have to do #1 as
> it's required to implement atomic percpu allocations too.
>
> I'll try to reproduce the problem here and verify the previous locking
> dance patch.

I havent looked deeply but at first sight i'm not 100% sure that even
the lock dance hack is safe - doesnt vfree() do TLB flushes, which must
be done with irqs enabled in general? If yes, then the whole notion of
using the allocator from irqs-off sections is wrong and the flags
save/restore is misguided (or at least incomplete).

So the real problem right now i think is the use of the pcpu allocator
from within a BH section (and from irqs-off sections) - that usage
should be eliminated from .32, or the allocator should be fixed. (which
looks non-trivial vmalloc/vfree was never really intended to be used in
irq-atomic contexts)

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/