Re: PATCH: Race in 2.6.0-test2 timer code

From: Andrea Arcangeli (andrea@suse.de)
Date: Wed Jul 30 2003 - 18:56:07 EST


On Wed, Jul 30, 2003 at 06:43:18PM -0500, linas@austin.ibm.com wrote:
> Hi,
>
> OK, I finally spent some time studying the timer code and am starting
> to grasp it. I think I finally understand the bug. Andrea's timer->lock
> patch will fix it for 2.4 and everybody says it can't happen in 2.6 so
> ok.
>
>
> On Thu, Jul 31, 2003 at 12:17:17AM +0200, Andrea Arcangeli wrote:
> > So the best fix would be to nuke the run_all_timers thing from 2.4 too.
>
> Yes.
>
> >and to keep the timer->lock everywhere to make run_all_timers safe.
>
> Or do that instead, yes.
>
> > In short the stack traces I described today were all right but only for
> > 2.4, and not for 2.6.
>
> I see the bug in 2.4.
>
> On Wed, Jul 30, 2003 at 12:31:23PM +0200, Ingo Molnar wrote:
> >
> > On Wed, 30 Jul 2003, Andrea Arcangeli wrote:
> >
> > >
> > > cpu0 cpu1
> > > ------------ --------------------
> > >
> > > do_setitimer
> > > it_real_fn
> > > del_timer_sync add_timer -> crash
> >
> > would you mind to elaborate the precise race? I cannot see how the above
> > sequence could crash on current 2.6:
>
> I don't know enough about how 2.6 works to say,
> but here's more detail on what happened in 2.4:
>
>
> cpu 0 cpu 3
> ------- ---------
> previously, timer->base = cpu 3 base
> (its task-struct->real_timer)
> bh_action() {
> __run_timers() { sys_setitimer() {
> it_real_fn() // called as fn() del_timer_sync() {
> add_timer() { spinlock (cpu3 base)
> spinlock (cpu0 base)
> timer->base = cpu0 base detach_timer (timer)
> list_add(&timer->list)
> timer->list.next = NULL;
> and now timer is vectored in but can't be unlinked.
>
> And so either of Andrea's fixes should fix this race.

exactly. If you want to nuke the run_all_timers from the 2.4 backport
feel free, then we could drop the additional locking from
add_timer/del_timer* and leave it only in mod_timer like 2.6 does, that
will avoid cacheline bouncing in the small window when the local timer
irq run on top of an irq handler. In the meantime the kernel is already
solid (again ;).

2.6 will kernel crash like we did in 2.4 only if it calls
add_timer_on from timer context (of course with a cpuid != than the
smp_processor_id()), that would be fixed by the timer->lock everywhere
that we've in 2.4 right now. (but there's no add_timer_on in 2.4
anyways)

BTW, it's reassuring it wasn't lack of 2.6 stress testing, I apologize
for claiming that.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Thu Jul 31 2003 - 22:00:48 EST