Re: [PATCH RESEND v4] fs/epoll: Remove unnecessary wakeups of nested epoll that in ET mode

From: Heiher
Date: Wed Oct 09 2019 - 02:04:19 EST


Hi,

On Tue, Oct 8, 2019 at 3:10 AM Roman Penyaev <rpenyaev@xxxxxxx> wrote:
>
> On 2019-10-07 20:43, Jason Baron wrote:
> > On 10/7/19 2:30 PM, Roman Penyaev wrote:
> >> On 2019-10-07 18:42, Jason Baron wrote:
> >>> On 10/7/19 6:54 AM, Roman Penyaev wrote:
> >>>> On 2019-10-03 18:13, Jason Baron wrote:
> >>>>> On 9/30/19 7:55 AM, Roman Penyaev wrote:
> >>>>>> On 2019-09-28 04:29, Andrew Morton wrote:
> >>>>>>> On Wed, 25 Sep 2019 09:56:03 +0800 hev <r@xxxxxx> wrote:
> >>>>>>>
> >>>>>>>> From: Heiher <r@xxxxxx>
> >>>>>>>>
> >>>>>>>> Take the case where we have:
> >>>>>>>>
> >>>>>>>> t0
> >>>>>>>> | (ew)
> >>>>>>>> e0
> >>>>>>>> | (et)
> >>>>>>>> e1
> >>>>>>>> | (lt)
> >>>>>>>> s0
> >>>>>>>>
> >>>>>>>> t0: thread 0
> >>>>>>>> e0: epoll fd 0
> >>>>>>>> e1: epoll fd 1
> >>>>>>>> s0: socket fd 0
> >>>>>>>> ew: epoll_wait
> >>>>>>>> et: edge-trigger
> >>>>>>>> lt: level-trigger
> >>>>>>>>
> >>>>>>>> We only need to wakeup nested epoll fds if something has been
> >>>>>>>> queued
> >>>>>>>> to the
> >>>>>>>> overflow list, since the ep_poll() traverses the rdllist during
> >>>>>>>> recursive poll
> >>>>>>>> and thus events on the overflow list may not be visible yet.
> >>>>>>>>
> >>>>>>>> Test code:
> >>>>>>>
> >>>>>>> Look sane to me. Do you have any performance testing results
> >>>>>>> which
> >>>>>>> show a benefit?
> >>>>>>>
> >>>>>>> epoll maintainership isn't exactly a hive of activity nowadays :(
> >>>>>>> Roman, would you please have time to review this?
> >>>>>>
> >>>>>> So here is my observation: current patch does not fix the
> >>>>>> described
> >>>>>> problem (double wakeup) for the case, when new event comes exactly
> >>>>>> to the ->ovflist. According to the patch this is the desired
> >>>>>> intention:
> >>>>>>
> >>>>>> /*
> >>>>>> * We only need to wakeup nested epoll fds if something has
> >>>>>> been
> >>>>>> queued
> >>>>>> * to the overflow list, since the ep_poll() traverses the
> >>>>>> rdllist
> >>>>>> * during recursive poll and thus events on the overflow list
> >>>>>> may
> >>>>>> not be
> >>>>>> * visible yet.
> >>>>>> */
> >>>>>> if (nepi != NULL)
> >>>>>> pwake++;
> >>>>>>
> >>>>>> ....
> >>>>>>
> >>>>>> if (pwake == 2)
> >>>>>> ep_poll_safewake(&ep->poll_wait);
> >>>>>>
> >>>>>>
> >>>>>> but this actually means that we repeat the same behavior (double
> >>>>>> wakeup)
> >>>>>> but only for the case, when event comes to the ->ovflist.
> >>>>>>
> >>>>>> How to reproduce? Can be easily done (ok, not so easy but it is
> >>>>>> possible
> >>>>>> to try): to the given userspace test we need to add one more
> >>>>>> socket
> >>>>>> and
> >>>>>> immediately fire the event:
> >>>>>>
> >>>>>> e.events = EPOLLIN;
> >>>>>> if (epoll_ctl(efd[1], EPOLL_CTL_ADD, s2fd[0], &e) < 0)
> >>>>>> goto out;
> >>>>>>
> >>>>>> /*
> >>>>>> * Signal any fd to let epoll_wait() to call
> >>>>>> ep_scan_ready_list()
> >>>>>> * in order to "catch" it there and add new event to
> >>>>>> ->ovflist.
> >>>>>> */
> >>>>>> if (write(s2fd[1], "w", 1) != 1)
> >>>>>> goto out;
> >>>>>>
> >>>>>> That is done in order to let the following epoll_wait() call to
> >>>>>> invoke
> >>>>>> ep_scan_ready_list(), where we can "catch" and insert new event
> >>>>>> exactly
> >>>>>> to the ->ovflist. In order to insert event exactly in the correct
> >>>>>> list
> >>>>>> I introduce artificial delay.
> >>>>>>
> >>>>>> Modified test and kernel patch is below. Here is the output of
> >>>>>> the
> >>>>>> testing tool with some debug lines from kernel:
> >>>>>>
> >>>>>> # ~/devel/test/edge-bug
> >>>>>> [ 59.263178] ### sleep 2
> >>>>>> >> write to sock
> >>>>>> [ 61.318243] ### done sleep
> >>>>>> [ 61.318991] !!!!!!!!!!! ep_poll_safewake(&ep->poll_wait);
> >>>>>> events_in_rdllist=1, events_in_ovflist=1
> >>>>>> [ 61.321204] ### sleep 2
> >>>>>> [ 63.398325] ### done sleep
> >>>>>> error: What?! Again?!
> >>>>>>
> >>>>>> First epoll_wait() call (ep_scan_ready_list()) observes 2 events
> >>>>>> (see "!!!!!!!!!!! ep_poll_safewake" output line), exactly what we
> >>>>>> wanted to achieve, so eventually ep_poll_safewake() is called
> >>>>>> again
> >>>>>> which leads to double wakeup.
> >>>>>>
> >>>>>> In my opinion current patch as it is should be dropped, it does
> >>>>>> not
> >>>>>> fix the described problem but just hides it.
> >>>>>>
> >>>>>> --
> >>>>
> >>>> Hi Jason,
> >>>>
> >>>>>
> >>>>> Yes, there are 2 wakeups in the test case you describe, but if the
> >>>>> second event (write to s1fd) gets queued after the first call to
> >>>>> epoll_wait(), we are going to get 2 wakeups anyways.
> >>>>
> >>>> Yes, exactly, for this reason I print out the number of events
> >>>> observed
> >>>> on first wait, there should be 1 (rdllist) and 1 (ovflist),
> >>>> otherwise
> >>>> this is another case, when second event comes exactly after first
> >>>> wait, which is legitimate wakeup.
> >>>>
> >>>>> So yes, there may
> >>>>> be a slightly bigger window with this patch for 2 wakeups, but its
> >>>>> small
> >>>>> and I tried to be conservative with the patch - I'd rather get an
> >>>>> occasional 2nd wakeup then miss one. Trying to debug missing
> >>>>> wakeups
> >>>>> isn't always fun...
> >>>>>
> >>>>> That said, the reason for propagating events that end up on the
> >>>>> overflow
> >>>>> list was to prevent the race of the wakee not seeing events because
> >>>>> they
> >>>>> were still on the overflow list. In the testcase, imagine if there
> >>>>> was a
> >>>>> thread doing epoll_wait() on efd[0], and then a write happends on
> >>>>> s1fd.
> >>>>> I thought it was possible then that a 2nd thread doing epoll_wait()
> >>>>> on
> >>>>> efd[1], wakes up, checks efd[0] and sees no events because they are
> >>>>> still potentially on the overflow list. However, I think that case
> >>>>> is
> >>>>> not possible because the thread doing epoll_wait() on efd[0] is
> >>>>> going to
> >>>>> have the ep->mtx, and thus when the thread wakes up on efd[1], its
> >>>>> going
> >>>>> to have to be ordered because its also grabbing the ep->mtx
> >>>>> associated
> >>>>> with efd[0].
> >>>>>
> >>>>> So I think its safe to do the following if we want to go further
> >>>>> than
> >>>>> the proposed patch, which is what you suggested earlier in the
> >>>>> thread
> >>>>> (minus keeping the wakeup on ep->wq).
> >>>>
> >>>> Then I do not understand why we need to keep ep->wq wakeup?
> >>>> @wq and @poll_wait are almost the same with only one difference:
> >>>> @wq is used when you sleep on it inside epoll_wait() and the other
> >>>> is used when you nest epoll fd inside epoll fd. Either you wake
> >>>> both up either you don't this at all.
> >>>>
> >>>> ep_poll_callback() does wakeup explicitly, ep_insert() and
> >>>> ep_modify()
> >>>> do wakeup explicitly, so what are the cases when we need to do
> >>>> wakeups
> >>>> from ep_scan_ready_list()?
> >>>
> >>> Hi Roman,
> >>>
> >>> So the reason I was saying not to drop the ep->wq wakeup was that I
> >>> was
> >>> thinking about a usecase where you have multi-threads say thread A
> >>> and
> >>> thread B which are doing epoll_wait() on the same epfd. Now, the
> >>> threads
> >>> both call epoll_wait() and are added as exclusive to ep->wq. Now a
> >>> bunch
> >>> of events happen and thread A is woken up. However, thread A may only
> >>> process a subset of the events due to its 'maxevents' parameter. In
> >>> that
> >>> case, I was thinking that the wakeup on ep->wq might be helpful,
> >>> because
> >>> in the absence of subsequent events, thread B can now start
> >>> processing
> >>> the rest, instead of waiting for the next event to be queued.
> >>>
> >>> However, I was thinking about the state of things before:
> >>> 86c0517 fs/epoll: deal with wait_queue only once
> >>>
> >>> Before that patch, thread A would have been removed from eq->wq
> >>> before
> >>> the wakeup call, thus waking up thread B. However, now that thread A
> >>> stays on the queue during the call to ep_send_events(), I believe the
> >>> wakeup would only target thread A, which doesn't help since its
> >>> already
> >>> checking for events. So given the state of things I think you are
> >>> right
> >>> in that its not needed. However, I wonder if not removing from the
> >>> ep->wq affects the multi-threaded case I described. Its been around
> >>> since 5.0, so probably not, but it would be a more subtle performance
> >>> difference.
> >>
> >> Now I understand what you mean. You want to prevent "idling" of
> >> events,
> >> while thread A is busy with the small portion of events (portion is
> >> equal
> >> to maxevents). On next iteration thread A will pick up the rest, no
> >> doubts,
> >> but would be nice to give a chance to thread B immediately to deal
> >> with the
> >> rest. Ok, makes sense.
> >
> > Exactly, I don't believe its racy as is - but it seems like it would be
> > good to wakeup other threads that may be waiting. That said, this logic
> > was removed as I pointed out. So I'm not sure we need to tie this
> > change
> > to the current one - but it may be a nice addition.
> >
> >>
> >> But what if to make this wakeup explicit if we have more events to
> >> process?
> >> (nothing is tested, just a guess)
> >>
> >> @@ -255,6 +255,7 @@ struct ep_pqueue {
> >> struct ep_send_events_data {
> >> int maxevents;
> >> struct epoll_event __user *events;
> >> + bool have_more;
> >> int res;
> >> };
> >> @@ -1783,14 +1768,17 @@ static __poll_t ep_send_events_proc(struct
> >> eventpoll *ep, struct list_head *head
> >> }
> >>
> >> static int ep_send_events(struct eventpoll *ep,
> >> - struct epoll_event __user *events, int
> >> maxevents)
> >> + struct epoll_event __user *events, int
> >> maxevents,
> >> + bool *have_more)
> >> {
> >> - struct ep_send_events_data esed;
> >> -
> >> - esed.maxevents = maxevents;
> >> - esed.events = events;
> >> + struct ep_send_events_data esed = {
> >> + .maxevents = maxevents,
> >> + .events = events,
> >> + };
> >>
> >> ep_scan_ready_list(ep, ep_send_events_proc, &esed, 0, false);
> >> + *have_more = esed.have_more;
> >> +
> >> return esed.res;
> >> }
> >>
> >> @@ -1827,7 +1815,7 @@ static int ep_poll(struct eventpoll *ep, struct
> >> epoll_event __user *events,
> >> {
> >> int res = 0, eavail, timed_out = 0;
> >> u64 slack = 0;
> >> - bool waiter = false;
> >> + bool waiter = false, have_more;
> >> wait_queue_entry_t wait;
> >> ktime_t expires, *to = NULL;
> >>
> >> @@ -1927,7 +1915,8 @@ static int ep_poll(struct eventpoll *ep, struct
> >> epoll_event __user *events,
> >> * more luck.
> >> */
> >> if (!res && eavail &&
> >> - !(res = ep_send_events(ep, events, maxevents)) &&
> >> !timed_out)
> >> + !(res = ep_send_events(ep, events, maxevents, &have_more))
> >> &&
> >> + !timed_out)
> >> goto fetch_events;
> >>
> >> if (waiter) {
> >> @@ -1935,6 +1924,12 @@ static int ep_poll(struct eventpoll *ep, struct
> >> epoll_event __user *events,
> >> __remove_wait_queue(&ep->wq, &wait);
> >> spin_unlock_irq(&ep->wq.lock);
> >> }
> >> + /*
> >> + * We were not able to process all the events, so immediately
> >> + * wakeup other waiter.
> >> + */
> >> + if (res > 0 && have_more && waitqueue_active(&ep->wq))
> >> + wake_up(&ep->wq);
> >>
> >> return res;
> >> }
> >>
> >>
> >
> > Ok, yeah I like making it explicit. Looks like you are missing the
> > changes to ep_scan_ready_list(), but I think the general approach makes
> > sense.
>
> Yeah, missed the hunk:
>
> @@ -1719,8 +1704,10 @@ static __poll_t ep_send_events_proc(struct
> eventpoll *ep, struct list_head *head
> lockdep_assert_held(&ep->mtx);
>
> list_for_each_entry_safe(epi, tmp, head, rdllink) {
> - if (esed->res >= esed->maxevents)
> + if (esed->res >= esed->maxevents) {
> + esed->have_more = true;
> break;
> + }
>
> > Although I really didn't have a test case that motivated this -
> > its just was sort of noting this change in behavior while reviewing the
> > current change.
> >
> >> PS. So what we decide with the original patch? Remove the whole
> >> branch?
> >>
> >
> > For fwiw, I'm ok removing the whole branch as you proposed.
>
> Then probably Heiher could resend once more. Heiher, can you, please?

Sorry for delay.
Thank you for your help! That's ok, I will re-send the patch and add
unit-tests to
kselftests in the later patches.

>
> > And I think the above change can go in separately (if we decide we want
> > it). I don't
> > think they need to be tied together. I also want to make sure this
> > change gets a full linux-next cycle, so I think it should target 5.5 at
> > this point.
>
> Sure, this explicit ->wq wakeup is a separate patch, which should be
> covered with some benchmarks. I can try to cook something out in order
> to get numbers.
>
> --
> Roman
>
--
Best regards!
Hev
https://hev.cc