Re: regression introduced by - timers: fix itimer/many thread hang

From: Oleg Nesterov
Date: Thu Nov 06 2008 - 07:00:22 EST


> Begin forwarded message:
>
> On Tue, 2008-10-28 at 14:38 -0400, Doug Chapman wrote:
> > On Mon, 2008-10-27 at 11:39 -0700, Frank Mayhar wrote:
> > > On Wed, 2008-10-22 at 13:03 -0400, Doug Chapman wrote:
> > > > Unable to handle kernel paging request at virtual address
> > > > 94949494949494a4
> > >
> > > I take it this can be read as an uninitialized (or cleared) pointer?
> > >
> > > It certainly looks like this is a race in thread (process?) teardown. I
> > > don't have hardware on which to reproduce this but _looks_ like another
> > > thread has gotten in and torn down the process while we've been busy.
> >
> > I finally managed to get kdump working and caught this in the act. I
> > still need to dig into this more but I think these 2 threads will show
> > us the race condition. Note that this is a slightly hacked kernel in
> > that I removed "static" from a few functions to better see what was
> > going on but no real functional changes when compared to a recent (day
> > old or so) git pull from Linus's tree.
>
> After digging through this a bit, I've concluded that it's probably a
> race between process reap and the dequeue_entity() call to update_curr()
> combined with a side effect of the slab debug stuff. The
> account_group_exec_runtime() routine (like the rest of these routines)
> checks tsk->signal and tsk->signal->cputime.totals for NULL to make sure
> they're still valid. It looks like at this point tsk->signal is valid
> (since the tsk->signal->cputime dereference succeeded) but
> tsk->signal->cputime.totals is invalid. That can't happen unless the
> process is being reaped,

Frank, currently I don't have the source code which I can look at,
so I am probably wrong... But just in case, perhaps we can do

- account_group_exec_runtime(...);
+ if (lock_task_sighand(...)) {
+ account_group_exec_runtime(...);
+ unlock_task_sighand();
+ }

?

Once we take ->siglock the task can't be reaped, and ->signal becomes
stable and != NULL.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/