Re: [PATCH] epoll: ensure ep_poll() doesn't miss wakeup events

From: Jason Baron
Date: Fri May 01 2020 - 18:11:00 EST




On 5/1/20 5:02 PM, Roman Penyaev wrote:
> Hi Jason,
>
> That is indeed a nice catch.
> Seems we need smp_rmb() pair between list_empty_careful(&rp->rdllist) and
> READ_ONCE(ep->ovflist) for ep_events_available(), do we?
>

Hi Roman,

Good point, even if we order those reads its still racy, since the
read of the ready list could come after its been cleared and the
read of the overflow could again come after its been cleared.

So I'm afraid we might need instead something like this to make
sure they are read together:

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index d6ba0e5..31c5f14 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1899,7 +1899,9 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
break;
}

+ write_lock_irq(&ep->lock);
eavail = ep_events_available(ep);
+ write_unlock_irq(&ep->lock);
if (eavail)
break;
if (signal_pending(current)) {


Thanks,

-Jason



> Other than that:
>
> Reviewed-by: Roman Penyaev <rpenyaev@xxxxxxx>
>
> --
> Roman
>
> On 2020-05-01 21:15, Jason Baron wrote:
>> Now that the ep_events_available() check is done in a lockless way, and
>> we no longer perform wakeups from ep_scan_ready_list(), we need to ensure
>> that either ep->rdllist has items or the overflow list is active. Prior to:
>> commit 339ddb53d373 ("fs/epoll: remove unnecessary wakeups of nested
>> epoll"), we did wake_up(&ep->wq) after manipulating the ep->rdllist and the
>> overflow list. Thus, any waiters would observe the correct state. However,
>> with that wake_up() now removed we need to be more careful to ensure that
>> condition.
>>
>> Here's an example of what could go wrong:
>>
>> We have epoll fds: epfd1, epfd2. And epfd1 is added to epfd2 and epfd2 is
>> added to a socket: epfd1->epfd2->socket. Thread a is doing epoll_wait() on
>> epfd1, and thread b is doing epoll_wait on epfd2. Then:
>>
>> 1) data comes in on socket
>>
>> ep_poll_callback() wakes up threads a and b
>>
>> 2) thread a runs
>>
>> ep_poll()
>> Âep_scan_ready_list()
>> Â ep_send_events_proc()
>> ÂÂ ep_item_poll()
>> ÂÂÂÂ ep_scan_ready_list()
>> ÂÂÂÂÂÂ list_splice_init(&ep->rdllist, &txlist);
>>
>> 3) now thread b is running
>>
>> ep_poll()
>> Âep_events_available()
>> ÂÂ returns false
>> Âschedule_hrtimeout_range()
>>
>> Thus, thread b has now scheduled and missed the wakeup.
>>
>> Fixes: 339ddb53d373 ("fs/epoll: remove unnecessary wakeups of nested epoll")
>> Signed-off-by: Jason Baron <jbaron@xxxxxxxxxx>
>> Cc: Alexander Viro <viro@xxxxxxxxxxxxxxxxxx>
>> Cc: Heiher <r@xxxxxx>
>> Cc: Roman Penyaev <rpenyaev@xxxxxxx>
>> Cc: Khazhismel Kumykov <khazhy@xxxxxxxxxx>
>> Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
>> Cc: Davidlohr Bueso <dbueso@xxxxxxx>
>> Cc: <stable@xxxxxxxxxxxxxxx>
>> ---
>> Âfs/eventpoll.c | 23 +++++++++++++++++------
>> Â1 file changed, 17 insertions(+), 6 deletions(-)
>>
>> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
>> index aba03ee749f8..4af2d020f548 100644
>> --- a/fs/eventpoll.c
>> +++ b/fs/eventpoll.c
>> @@ -704,8 +704,14 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep,
>> ÂÂÂÂÂ * in a lockless way.
>> ÂÂÂÂÂ */
>> ÂÂÂÂ write_lock_irq(&ep->lock);
>> -ÂÂÂ list_splice_init(&ep->rdllist, &txlist);
>> ÂÂÂÂ WRITE_ONCE(ep->ovflist, NULL);
>> +ÂÂÂ /*
>> +ÂÂÂÂ * In ep_poll() we use ep_events_available() in a lockless way to decide
>> +ÂÂÂÂ * if events are available. So we need to preserve that either
>> +ÂÂÂÂ * ep->oflist != EP_UNACTIVE_PTR or there are events on the ep->rdllist.
>> +ÂÂÂÂ */
>> +ÂÂÂ smp_wmb();
>> +ÂÂÂ list_splice_init(&ep->rdllist, &txlist);
>> ÂÂÂÂ write_unlock_irq(&ep->lock);
>>
>> ÂÂÂÂ /*
>> @@ -737,16 +743,21 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep,
>> ÂÂÂÂÂÂÂÂ }
>> ÂÂÂÂ }
>> ÂÂÂÂ /*
>> +ÂÂÂÂ * Quickly re-inject items left on "txlist".
>> +ÂÂÂÂ */
>> +ÂÂÂ list_splice(&txlist, &ep->rdllist);
>> +ÂÂÂ /*
>> +ÂÂÂÂ * In ep_poll() we use ep_events_available() in a lockless way to decide
>> +ÂÂÂÂ * if events are available. So we need to preserve that either
>> +ÂÂÂÂ * ep->oflist != EP_UNACTIVE_PTR or there are events on the ep->rdllist.
>> +ÂÂÂÂ */
>> +ÂÂÂ smp_wmb();
>> +ÂÂÂ /*
>> ÂÂÂÂÂ * We need to set back ep->ovflist to EP_UNACTIVE_PTR, so that after
>> ÂÂÂÂÂ * releasing the lock, events will be queued in the normal way inside
>> ÂÂÂÂÂ * ep->rdllist.
>> ÂÂÂÂÂ */
>> ÂÂÂÂ WRITE_ONCE(ep->ovflist, EP_UNACTIVE_PTR);
>> -
>> -ÂÂÂ /*
>> -ÂÂÂÂ * Quickly re-inject items left on "txlist".
>> -ÂÂÂÂ */
>> -ÂÂÂ list_splice(&txlist, &ep->rdllist);
>> ÂÂÂÂ __pm_relax(ep->ws);
>> ÂÂÂÂ write_unlock_irq(&ep->lock);
>