PThreads, signals, futex and SMP

From: Guilhem Lavaux
Date: Sat Jan 01 2005 - 11:49:36 EST


Hi,

I am one of the developer of kaffe, a GPL implementation of the Java Virtual Machine, and we encountered some problems on Linux/2.6/SMP. I am currently running 2.6.8.1 Mandrake Kernel. I can try whether the problem is reproduceable on a vanilla kernel but I don't think this part has been touched.

Here is the problem: we have a regression test which is quite intense in thread creation/destruction, each thread can start the garbage collector. The garbage collector needs to stop all running thread to be able to walk the heap/stack. For this, it uses a particular signal which is sent to all threads. The signal handler calls sigwait to stop the thread. 50% of the time everything is fine but from time to time, kaffe has a deadlock. It appears that it always happen when we are in the following configuration:

Thread 1
------------

sigwait
<signal handler>
futex syscall
pthread_mutex_unlock

Thread 2
------------
futex syscall
pthread_mutex_lock
Garbage Collector thread

Now if we look at the futex code in the linux kernel, we see this:

static int futex_wait(unsigned long uaddr, int val, unsigned long time)
{
DECLARE_WAITQUEUE(wait, current);
int ret, curval;
struct futex_q q;

down_read(&current->mm->mmap_sem);

the kernel then prepares the wait queue and unlock mmap_sem.


Concerning the mutex_unlock part we have this:

static int futex_wake(unsigned long uaddr, int nr_wake)
{
union futex_key key;
struct futex_hash_bucket *bh;
struct list_head *head;
struct futex_q *this, *next;
int ret;

down_read(&current->mm->mmap_sem);

and the kernel iterates the semaphores and wakes up all threads.


What may happen if the signal handler is called after down_read in futex_wake ? My guess is that we are not able to call futex_wait because the application will deadlock because the first thread is frozen by a sigwait.

So either we have a limitation of the kernel either a bug if the analysis is correct.
The only point is that I am not sure whether a signal is allowed to interrupt a syscall just in the middle of futex_wake. If this is not possible there may be a bug in our application somewhere else.

Regards,

Guilhem Lavaux.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/