Re: PATCH: Race in 2.6.0-test2 timer code

From: linas@austin.ibm.com
Date: Wed Jul 30 2003 - 16:18:10 EST


On Wed, Jul 30, 2003 at 02:34:58PM +0200, Andrea Arcangeli wrote:
>
> so if it was really the itimer, it had to be on ppc s390 or ia64, not on
> x86. I never reproduced this myself. I will ask more info on bugzilla,
> because I thought it was x86 but maybe it wasn't. As said in the
> previous email, only non x86 archs can run the timer irq on a cpu
> different than the one where it was inserted.

I think I started the thread, so let me backtrack. I've got several
ppc64 boxes running 2.4.21. They hang, they don't crash. Sometimes
they're pingable, but that's all. The little yellow button is wired
up to KDB, so I can break in and poke around and see what's happening.

Here's what I find:

__run_timers() is stuck in a infinite loop, because there's a
timer on the base. However, this timer has timer->list.net == NULL,
and so it can never be removed. (As a side effect, other CPU's get
stuck in various places, sitting in spinlocks, etc. which is why the
machine goes unresponsive) So the question becomes "how did this
pointer get trashed?"

I thought I saw an "obvious race" in the 2.4 code, and it looked
identical to the 2.6 code, and thus the email went out. Now, I'm
not so sure; I'm confused.

However, given that its timer->list.net that is getting clobbered,
then locking all occurrances of list_add and list_del should cure the
problem. I'm guessing that Andrea's patch to add a timer->lock
will protect all of the list_add's, list_del's that matter.

Given that I don't yet understand how timer->list.net got clobbered,
I can't yet say 'yes/no this fixes it'.

--linas

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Thu Jul 31 2003 - 22:00:47 EST