Re: [PATCH 1/2] mm, vmscan: prevent kswapd livelock due to pfmemalloc-throttled process being killed

From: Vladimir Davydov
Date: Mon Dec 22 2014 - 11:26:17 EST


On Mon, Dec 22, 2014 at 03:24:35PM +0100, Michal Hocko wrote:
> On Sat 20-12-14 17:18:24, Vladimir Davydov wrote:
> > On Sat, Dec 20, 2014 at 11:47:46AM +0100, Michal Hocko wrote:
> > > On Fri 19-12-14 21:28:15, Vladimir Davydov wrote:
> > > > So AFAIU the problem does exist. However, I think it could be fixed by
> > > > simply waking up all processes waiting on pfmemalloc_wait before putting
> > > > kswapd to sleep:
> > >
> > > I think that a simple cond_resched() in kswapd_try_to_sleep should be
> > > sufficient and less risky fix, so basically what Vlastimil was proposing
> > > in the beginning.
> >
> > With such a solution we implicitly rely upon the scheduler
> > implementation, which AFAIU is wrong.
>
> But this is a scheduling problem, isn't it? !PREEMPT kernel with a
> kernel thread looping without a scheduling point which prevents the
> woken task to run due to cpu affinity...

I wouldn't say so. To me it looks more like an abuse of the workqueue
API. AFAIU an abstract scheduler algorithm isn't supposed to guarantee
that an arbitrary process will ever get scheduled unless the CPU is
idle. Of course, my example below looks contrived. Nobody will raise
kswapd priority for it to preempt other processes. However, IMO we
shouldn't rely on that in the mm subsys, which is rather orthogonal to
the sched.

>
> > E.g. suppose processes are
> > governed by FIFO and kswapd happens to have a higher prio than the
> > process killed by OOM. Then after cond_resched kswapd will be picked for
> > execution again, and the killing process won't have a chance to remove
> > itself from the wait queue.
>
> Except that kswapd runs as SCHED_NORMAL with 0 priority.
>
> > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > index 744e2b491527..2a123634c220 100644
> > > > --- a/mm/vmscan.c
> > > > +++ b/mm/vmscan.c
> > > > @@ -2984,6 +2984,9 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
> > > > if (remaining)
> > > > return false;
> > > >
> > > > + if (!pgdat_balanced(pgdat, order, classzone_idx))
> > > > + return false;
> > > > +
> > >
> > > What would be consequences of not waking up pfmemalloc waiters while the
> > > node is not balanced?
> >
> > They will get woken up a bit later in balanced_pgdat. This might result
> > in latency spikes though. In order not to change the original behaviour
> > we could always wake all pfmemalloc waiters no matter if we are going to
> > sleep or not:
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 744e2b491527..a21e0bd563c3 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2993,10 +2993,7 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
> > * so wake them now if necessary. If necessary, processes will wake
> > * kswapd and get throttled again
> > */
> > - if (waitqueue_active(&pgdat->pfmemalloc_wait)) {
> > - wake_up(&pgdat->pfmemalloc_wait);
> > - return false;
> > - }
> > + wake_up_all(&pgdat->pfmemalloc_wait);
> >
> > return pgdat_balanced(pgdat, order, classzone_idx);
>
> So you are relying on scheduling points somewhere down the
> balance_pgdat. That should be sufficient. I am still quite surprised
> that we have an OOM victim still on the queue and balanced pgdat here
> because OOM victim didn't have chance to free memory. So somebody else
> must have released a lot of memory after OOM.
>
> This patch seems better than the one from Vlastimil. Care to post it
> with the full changelog, please?

Attached below (merged with 2/2). I haven't checked that it does fix the
issue, because I don't have the reproducer, so it should be committed
only if Vlastimil approves it.

From: Vlastimil Babka <vbabka@xxxxxxx>
Subject: [PATCH] mm, vmscan: prevent kswapd livelock due to
pfmemalloc-throttled process being killed

Charles Shirron and Paul Cassella from Cray Inc have reported kswapd stuck
in a busy loop with nothing left to balance, but kswapd_try_to_sleep() failing
to sleep. Their analysis found the cause to be a combination of several
factors:

1. A process is waiting in throttle_direct_reclaim() on pgdat->pfmemalloc_wait

2. The process has been killed (by OOM in this case), but has not yet been
scheduled to remove itself from the waitqueue and die.

3. kswapd checks for throttled processes in prepare_kswapd_sleep() and
do not put itself to sleep if there are any:

if (waitqueue_active(&pgdat->pfmemalloc_wait)) {
wake_up(&pgdat->pfmemalloc_wait);
return false;
}

However, for a process that was already killed, wake_up() does not remove
the process from the waitqueue, since try_to_wake_up() checks its state
first and returns false when the process is no longer waiting.

4. kswapd is running on the same CPU as the only CPU that the process is
allowed to run on (through cpus_allowed, or possibly single-cpu system).

5. CONFIG_PREEMPT_NONE=y kernel is used. If there's nothing to balance, kswapd
encounters no voluntary preemption points and repeatedly fails
prepare_kswapd_sleep(), blocking the process from running and removing
itself from the waitqueue, which would let kswapd sleep.

So, the source of the problem is that we prevent kswapd from going to
sleep until there are processes waiting on the pfmemalloc_wait queue,
and a process waiting on a queue is guaranteed to be removed from the
queue only when it gets scheduled. This was done to avoid the race
between kswapd checking pfmemalloc_wait and a process getting throttled
as the comment in prepare_kswapd_sleep() explains.

However, it isn't necessary to postpone kswapd sleep until the
pfmemalloc_wait queue empties. To eliminate the race, it's actually
enough to guarantee that all processes waiting on pfmemalloc_wait queue
have been woken up by the time we put kswapd to sleep.

This patch therefore fixes this issue by substituting 'wake_up' with
'wake_up_all' and removing 'return false' in the code snippet from
prepare_kswapd_sleep() above.

Also, it replaces wake_up with wake_up_all in balance_pgdat(), because:
- using wake_up there might leave processes waiting for longer than
necessary, until the check is reached in the next loop iteration;
- processes might also be left waiting even if zone was fully balanced
in single iteration;
- the comment says "wake them" so the author of the commit that
introduced pfmemalloc_wait seemed to mean wake_up_all;
- this corresponds to how we wake processes waiting on pfmemalloc_wait
in prepare_kswapd_sleep.

Fixes: 5515061d22f0 ("mm: throttle direct reclaimers if PF_MEMALLOC reserves
are low and swap is backed by network storage")
Signed-off-by: Vlastimil Babka <vbabka@xxxxxxx>
Signed-off-by: Vladimir Davydov <vdavydov@xxxxxxxxxxxxx>
Cc: <stable@xxxxxxxxxxxxxxx> # v3.6+
Cc: Mel Gorman <mgorman@xxxxxxx>
Cc: Johannes Weiner <hannes@xxxxxxxxxxx>
Cc: Michal Hocko <mhocko@xxxxxxx>
Cc: Rik van Riel <riel@xxxxxxxxxx>
---
Changes in v2:
- instead of introducing yet another schedule() point in
kswapd_try_to_sleep(), allow kswapd to sleep even if the
pfmemalloc_wait queue is active, waking *all* throttled
processes before going to sleep

mm/vmscan.c | 8 +++-----
1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5e8772b2b9ef..65287944b2cf 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2961,10 +2961,8 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
* so wake them now if necessary. If necessary, processes will wake
* kswapd and get throttled again
*/
- if (waitqueue_active(&pgdat->pfmemalloc_wait)) {
- wake_up(&pgdat->pfmemalloc_wait);
- return false;
- }
+ if (waitqueue_active(&pgdat->pfmemalloc_wait))
+ wake_up_all(&pgdat->pfmemalloc_wait);

return pgdat_balanced(pgdat, order, classzone_idx);
}
@@ -3205,7 +3203,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
*/
if (waitqueue_active(&pgdat->pfmemalloc_wait) &&
pfmemalloc_watermark_ok(pgdat))
- wake_up(&pgdat->pfmemalloc_wait);
+ wake_up_all(&pgdat->pfmemalloc_wait);

/*
* Fragmentation may mean that the system cannot be rebalanced
--
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/