Re: [PATCH] epoll: ensure ep_poll() doesn't miss wakeup events

From: Jason Baron
Date: Mon May 04 2020 - 01:01:00 EST




On 5/4/20 12:29 AM, Jason Baron wrote:
>
>
> On 5/3/20 6:24 AM, Roman Penyaev wrote:
>> On 2020-05-02 00:09, Jason Baron wrote:
>>> On 5/1/20 5:02 PM, Roman Penyaev wrote:
>>>> Hi Jason,
>>>>
>>>> That is indeed a nice catch.
>>>> Seems we need smp_rmb() pair between list_empty_careful(&rp->rdllist) and
>>>> READ_ONCE(ep->ovflist) for ep_events_available(), do we?
>>>>
>>>
>>> Hi Roman,
>>>
>>> Good point, even if we order those reads its still racy, since the
>>> read of the ready list could come after its been cleared and the
>>> read of the overflow could again come after its been cleared.
>>
>> You mean the second chunk? True. Sigh.
>>
>>> So I'm afraid we might need instead something like this to make
>>> sure they are read together:
>>
>> No, impossible, I can't believe in that :) We can't give up.
>>
>> All we need is to keep a mark, that ep->rdllist is not empty,
>> even we've just spliced it. ep_poll_callback() always takes
>> the ->ovflist path, if ->ovflist is not EP_UNACTIVE_PTR, but
>> ep_events_available() does not need to observe ->ovflist at
>> all, just a ->rdllist.
>>
>> Take a look at that, do I miss something? :
>>
>> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
>> index aba03ee749f8..a8770f9a917e 100644
>> --- a/fs/eventpoll.c
>> +++ b/fs/eventpoll.c
>> @@ -376,8 +376,7 @@ static void ep_nested_calls_init(struct nested_calls *ncalls)
>> Â */
>> Âstatic inline int ep_events_available(struct eventpoll *ep)
>> Â{
>> -ÂÂÂÂÂÂ return !list_empty_careful(&ep->rdllist) ||
>> -ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR;
>> +ÂÂÂÂÂÂ return !list_empty_careful(&ep->rdllist);
>> Â}
>>
>> Â#ifdef CONFIG_NET_RX_BUSY_POLL
>> @@ -683,7 +682,8 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep,
>> Â{
>> ÂÂÂÂÂÂÂ __poll_t res;
>> ÂÂÂÂÂÂÂ struct epitem *epi, *nepi;
>> -ÂÂÂÂÂÂ LIST_HEAD(txlist);
>> +ÂÂÂÂÂÂ LIST_HEAD(rdllist);
>> +ÂÂÂÂÂÂ LIST_HEAD(ovflist);
>>
>> ÂÂÂÂÂÂÂ lockdep_assert_irqs_enabled();
>>
>> @@ -704,14 +704,22 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep,
>> ÂÂÂÂÂÂÂÂ * in a lockless way.
>> ÂÂÂÂÂÂÂÂ */
>> ÂÂÂÂÂÂÂ write_lock_irq(&ep->lock);
>> -ÂÂÂÂÂÂ list_splice_init(&ep->rdllist, &txlist);
>> +ÂÂÂÂÂÂ /*
>> +ÂÂÂÂÂÂÂ * We do not call list_splice_init() because for lockless
>> +ÂÂÂÂÂÂÂ * ep_events_available() ->rdllist is still "not empty".
>> +ÂÂÂÂÂÂÂ * Otherwise the feature that there is something left in
>> +ÂÂÂÂÂÂÂ * the list can be lost which causes missed wakeup.
>> +ÂÂÂÂÂÂÂ */
>> +ÂÂÂÂÂÂ list_splice(&ep->rdllist, &rdllist);
>> +ÂÂÂÂÂÂ /*
>> +ÂÂÂÂÂÂÂ * If ->rdllist was empty we should pretend it was not,
>> +ÂÂÂÂÂÂÂ * because after the unlock ->ovflist comes into play,
>> +ÂÂÂÂÂÂÂ * which is invisible for lockless ep_events_available().
>> +ÂÂÂÂÂÂÂ */
>> +ÂÂÂÂÂÂ ep->rdllist.next = LIST_POISON1;
>> ÂÂÂÂÂÂÂ WRITE_ONCE(ep->ovflist, NULL);
>> ÂÂÂÂÂÂÂ write_unlock_irq(&ep->lock);
>>
>> ÂÂÂÂÂÂÂ /*
>> ÂÂÂÂÂÂÂÂ * Now call the callback function.
>> ÂÂÂÂÂÂÂÂ */
>> -ÂÂÂÂÂÂ res = (*sproc)(ep, &txlist, priv);
>> +ÂÂÂÂÂÂ res = (*sproc)(ep, &rdllist, priv);
>>
>> ÂÂÂÂÂÂÂ write_lock_irq(&ep->lock);
>> ÂÂÂÂÂÂÂ /*
>> @@ -724,7 +732,7 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep,
>> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ /*
>> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ * We need to check if the item is already in the list.
>> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ * During the "sproc" callback execution time, items are
>> -ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ * queued into ->ovflist but the "txlist" might already
>> +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ * queued into ->ovflist but the "rdllist" might already
>> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ * contain them, and the list_splice() below takes care of them.
>> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ */
>> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ if (!ep_is_linked(epi)) {
>> @@ -732,7 +740,7 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep,
>> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ * ->ovflist is LIFO, so we have to reverse it in order
>> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ * to keep in FIFO.
>> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ */
>> -ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ list_add(&epi->rdllink, &ep->rdllist);
>> +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ list_add(&epi->rdllink, &ovflist);
>> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ep_pm_stay_awake(epi);
>> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ }
>> ÂÂÂÂÂÂÂ }
>> @@ -743,10 +751,11 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep,
>> ÂÂÂÂÂÂÂÂ */
>> ÂÂÂÂÂÂÂ WRITE_ONCE(ep->ovflist, EP_UNACTIVE_PTR);
>>
>> -ÂÂÂÂÂÂ /*
>> -ÂÂÂÂÂÂÂ * Quickly re-inject items left on "txlist".
>> -ÂÂÂÂÂÂÂ */
>> -ÂÂÂÂÂÂ list_splice(&txlist, &ep->rdllist);
>> +ÂÂÂÂÂÂ /* Events from ->ovflist happened later, thus splice to the tail */
>> +ÂÂÂÂÂÂ list_splice_tail(&ovflist, &rdllist);
>> +ÂÂÂÂÂÂ /* Just replace list */
>> +ÂÂÂÂÂÂ list_replace(&rdllist, &ep->rdllist);
>> +
>> ÂÂÂÂÂÂÂ __pm_relax(ep->ws);
>> ÂÂÂÂÂÂÂ write_unlock_irq(&ep->lock);
>>
>> @@ -1763,13 +1772,13 @@ static __poll_t ep_send_events_proc(struct eventpoll *ep, struct list_head *head
>> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ * Trigger mode, we need to insert back inside
>> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ * the ready list, so that the next call to
>> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ * epoll_wait() will check again the events
>> -ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ * availability. At this point, no one can insert
>> -ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ * into ep->rdllist besides us. The epoll_ctl()
>> -ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ * callers are locked out by
>> -ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ * ep_scan_ready_list() holding "mtx" and the
>> -ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ * poll callback will queue them in ep->ovflist.
>> +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ * availability. What we do here is simply
>> +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ * return the epi to the same position where
>> +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ * it was, the ep_scan_ready_list() will
>> +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ * re-inject the leftovers to the ->rdllist
>> +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ * under the proper lock.
>> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ */
>> -ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ list_add_tail(&epi->rdllink, &ep->rdllist);
>> +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ list_add_tail(&epi->rdllink, &tmp->rdllink);
>> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ep_pm_stay_awake(epi);
>> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ }
>> ÂÂÂÂÂÂÂ }
>>
>>
>> --
>> Roman
>>
>
>
> Hi Roman,
>
> I think this misses an important case - the initial ep_poll_callback()
> may queue to the overflow list. In this case ep_poll has no visibility
> into the event since its only checking ep->rdllist.

Ok, my mistake - I see it sets: ep->rdllist.next = LIST_POISON1; for that
case. Ok I think this approach makes sense then.

Thanks,

-Jason