[PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

From: Michal Hocko
Date: Thu Jul 26 2018 - 01:40:03 EST


Tetsuo Handa has reported that it is possible to bypass the short sleep
for PF_WQ_WORKER threads which was introduced by commit 373ccbe5927034b5
("mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make
any progress") and moved by commit ede37713737834d9 ("mm: throttle on IO
only when there are too many dirty and writeback pages") and lock up the
system if OOM.

This is because we are implicitly counting on falling back to
schedule_timeout_uninterruptible() in __alloc_pages_may_oom() when
schedule_timeout_uninterruptible() in should_reclaim_retry() was not
called due to __zone_watermark_ok() == false.

However, schedule_timeout_uninterruptible() in __alloc_pages_may_oom() is
not called if all allocating threads but a PF_WQ_WORKER thread got stuck at
__GFP_FS direct reclaim, for mutex_trylock(&oom_lock) by that PF_WQ_WORKER
thread succeeds and out_of_memory() remains no-op unless that PF_WQ_WORKER
thread is doing __GFP_FS allocation. Tetsuo is observing that GFP_NOIO
allocation request from disk_events_workfn() is preventing other pending
works from starting.

Since should_reclaim_retry() should be a natural reschedule point,
let's do the short sleep for PF_WQ_WORKER threads unconditionally
in order to guarantee that other pending works are started.

Reported-by: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx>
Signed-off-by: Michal Hocko <mhocko@xxxxxxxx>
Cc: Roman Gushchin <guro@xxxxxx>
Cc: Johannes Weiner <hannes@xxxxxxxxxxx>
Cc: Vladimir Davydov <vdavydov.dev@xxxxxxxxx>
Cc: David Rientjes <rientjes@xxxxxxxxxx>
Cc: Tejun Heo <tj@xxxxxxxxxx>
---
mm/page_alloc.c | 34 ++++++++++++++++++----------------
1 file changed, 18 insertions(+), 16 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a790ef4..0c2c0a2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3922,6 +3922,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
{
struct zone *zone;
struct zoneref *z;
+ bool ret = false;

/*
* Costly allocations might have made a progress but this doesn't mean
@@ -3985,25 +3986,26 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
}
}

- /*
- * Memory allocation/reclaim might be called from a WQ
- * context and the current implementation of the WQ
- * concurrency control doesn't recognize that
- * a particular WQ is congested if the worker thread is
- * looping without ever sleeping. Therefore we have to
- * do a short sleep here rather than calling
- * cond_resched().
- */
- if (current->flags & PF_WQ_WORKER)
- schedule_timeout_uninterruptible(1);
- else
- cond_resched();
-
- return true;
+ ret = true;
+ goto out;
}
}

- return false;
+out:
+ /*
+ * Memory allocation/reclaim might be called from a WQ
+ * context and the current implementation of the WQ
+ * concurrency control doesn't recognize that
+ * a particular WQ is congested if the worker thread is
+ * looping without ever sleeping. Therefore we have to
+ * do a short sleep here rather than calling
+ * cond_resched().
+ */
+ if (current->flags & PF_WQ_WORKER)
+ schedule_timeout_uninterruptible(1);
+ else
+ cond_resched();
+ return ret;
}

static inline bool
--
1.8.3.1