wake_up in a race with wait_event_timeout

From: Mark Krag
Date: Wed Sep 16 2015 - 11:10:38 EST


Is the current top of main branch free of the problem described (put in other words: is wait_event_timeout race condition safe)?

If to have a look at Git there were made several improvements in wait_event_timeout & Co (include/linux).
to avoid race conditions yet prior to 3.10 .
A problem being observed with my setup shows that at least as for 3.10 a race condition is still possible.

Situation short description:
Soc with two cores. One core is under Linux control. The another one not. Kernel 3.10 is used.
A Linux kernel (device driver, audio) sends a message packet to another core using some proprietary interface.
The condition is forced to “not met” by the kernel.
Subsequently it goes to sleep by wait_event_timeout while waiting for confirmation packet.
#define TIMEOUT_MS 1000

atomic_set(&this_inst.state, 1);
ret = _send_pkt(….);
if (ret < 0) {
    pr_err("%s: … failed \n", __func__);
    ret = -EINVAL;
    goto fail_cmd;
}
ret = wait_event_timeout(this_inst.wait[index],
                                                (atomic_read(&this_inst.state) == 0),
                                                msecs_to_jiffies(TIMEOUT_MS));
if (!ret) {
  pr_err("%s: wait_event timeout\n", __func__);
  ret = -EINVAL;
  goto fail_cmd;
}

The confirmation packet handler (possibly interrupt context, don’t know exactly) sets the condition to “met”
and executes wake_up for waiting kernel thread.
atomic_set(&this_inst.state, 0);
wake_up(&this_inst.wait[data->token]);
End of Situation short description.

Problem:
In some cases the wait_event_timeout runs to timeout rather than the thread be woken up.
However there are also cases where the waiting kernel thread gets woken up before the timeout elapse –
this applies to transmission of other message packets. Problems occur only on sending
of few types of packets (as for this minute two are known).
End of Problem.

Let’s have a look at the wait_event_timeout implementation.
It’s section related to problem mentioned above looks after all necessary macro resolutions like shown below (pseudo code).
if(condition met)
  break;
// …. few checks here
expire = __ret + jiffies;
setup_timer_on_stack(&timer, process_timeout, (unsigned long)current);
__mod_timer(&timer, expire, false, TIMER_NOT_PINNED);
schedule();

So there are some lines of code between last condition check and the schedule( ).
As for my understanding the wake_up has a good chance to hit here and be running to a loss.
Can you confirm or disprove this conclusion please?
This seems to happen in case of the problem in analysis.
Am I on proper path of my search for root cause?

On the tree there were made several optimizations in wait.h past 3.10.
However none of them addresses the problem as objected here.
How good is the chance to get the problem of lost wake-ups fixed
if to upgrade the kernel from 3.10 to current one?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/