Re: PATCH: Race in 2.6.0-test2 timer code

From: linas@austin.ibm.com
Date: Wed Jul 30 2003 - 15:05:39 EST


On Wed, Jul 30, 2003 at 07:57:32AM +0200, Ingo Molnar wrote:
>
> and i dont think Linas' patch is correct either - how can the timer base
> change under us? We are holding the timer spinlock.

I deduced the need for the patch by looking at the locking code immediately
above it, which takes pains to get old_base and new_base correctly locked.
My patch was to handle the case of old_base being NULL, which seemed to be
unhandled.

Comments & disclaimers:
-- I don't quite have the 'big picture' for the timer code yet,
   so please excuse any thinko's below.
-- I don't know under what cases timer->base can be NULL, but clearly
   there are checks for it being NULL, ergo...
-- I then see a race where timer->base is NULL, and so old_base is NULL,
   then some other CPU sets timer->base, then we get the lock
   on new_base, and blandly assume timer->base is NULL, which it no longer
   is.

A bit more graphically ...

timer->base = NULL; /* starting condition */

cpu 1 cpu 2
-------- ---------
mod_timer() {

old_base = timer->base;
if (old_base && ) { /* not taken */
}
else
. spin_lock(&cpu2_base->lock);
. timer->base = cpu2_base;
. spin_unlock(&cpu2_base->lock);
.
spin_lock(&new_base->lock);
                                                                                
/* Uhh oh, timer->base is not null,
   in fact its cpu2 base, and we aren't
   checking for this before we clobber it.
 */

I don't see what prevents the above from happening. And if the
above really can't happen, then I also don't understand why the fancy
old_base, new-base locking code is required at all.

> What i'd propose is the attached (tested, against 2.6.0-test2) patch

Due to technical difficulties at this end, I'm having trouble testing
any patches. The guy who knows how to run the test case to reproduce
this is out sick, its a big hairy testcase with all sorts of stuff
going on, and I can't make it go. It can take 8+ hours to hit it.
I'll need a few days ...

The other machine that hits it is a 6-way smp powerpc running nothing
but 6 copies of setiathome, but it takes 48+ hours to reproduce there,
so testing is even slower there ...

--linas
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Thu Jul 31 2003 - 22:00:47 EST