Re: [PATCH]: Re: kernel 2.6.9-rc1-mm4 oops

From: Kirill Korotaev
Date: Mon Sep 13 2004 - 04:06:56 EST

Next message: Sierota (o2.pl/tlen.pl): "Re: [BUG] kernel BUG at fs/jbd/checkpoint.c:646!"
Previous message: Andrew Morton: "Re: [BUG] kernel BUG at fs/jbd/checkpoint.c:646!"
In reply to: Ingo Molnar: "Re: [PATCH]: Re: kernel 2.6.9-rc1-mm4 oops"
Next in thread: Ingo Molnar: "Re: [PATCH]: Re: kernel 2.6.9-rc1-mm4 oops"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Ingo Molnar wrote:

* Kirill Korotaev <dev@xxxxx> wrote:

This patch removes sighand checks from the next_thread(), since they
are incorrect and has nothing to do with the next_thread() function.
So they could trigger BUG() when there were no actually bug at all.

the problem is, generally it is not valid to have a thread on the thread
list that has no ->sighand structure. This is what happens when we exit
a task:

write_lock_irq(&tasklist_lock);
...
__exit_sighand(p);
__unhash_process(p);

the BUG() is useful for all the code that uses next_thread() - you can
only do a safe next_thread() iteration if you've locked ->sighand.

1. I don't see spin_lock() on p->sighand->siglock in do_task_stat() before calling next_thread(). And the check inside next_thread() permits only one of the locks to be taken:

if (!spin_is_locked(&p->sighand->siglock) &&
!rwlock_is_locked(&tasklist_lock))

which is probably wrong, since tasklist_lock is always required!

2. I think the idea of checking sighand is quite obscure.
Probably it would be better to call pid_alive() for check at such places in proc, isn't it?

3. And yes, now I agree that this check and BUG() prevented next_thread() from walking through the deleted list_head in tsk->pid_list.

But I would propose to reorganize these checks in next_thread() to something like this:

if (!rwlock_is_locked(&tasklist_lock) || p->pids[PIDTYPE_TGID].nr == 0)
BUG();

the last check ensures that we are still hashed and this check is more straithforward for understanding, agree?

4. If we do checks this way, then we can found strange proc numeric results. Suppose, you have read the do_task_stat() and it iterated through the threads and summed the times in this loop with next_thread(). Next, the thread dies and you can read the results w/o this loop and threads times, only this thread stats. Looks a bit invalid. Don't you think so?
Maybe we should return an error?

so i believe your fix papers over the real bug which is the use of an
almost-dead task for thread iterations. Since we've already done
__unhash_process() not doing the BUG introduces a more subtle bug: the
use of the stale PID pointers! So i believe the right fix is the one
below, which (under the safety of read_lock(tasklock)) checks for the
availability of task->sighand - and skips the thread iterations if so.

You patch is correct. Though I would like to hear on my previous notes about pid_alive().

Kirill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Sierota (o2.pl/tlen.pl): "Re: [BUG] kernel BUG at fs/jbd/checkpoint.c:646!"
Previous message: Andrew Morton: "Re: [BUG] kernel BUG at fs/jbd/checkpoint.c:646!"
In reply to: Ingo Molnar: "Re: [PATCH]: Re: kernel 2.6.9-rc1-mm4 oops"
Next in thread: Ingo Molnar: "Re: [PATCH]: Re: kernel 2.6.9-rc1-mm4 oops"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]