[PATCH] mm, vmscan: do not loop on too_many_isolated for ever
From: Michal Hocko
Date:  Thu Jan 19 2017 - 06:11:56 EST
Tetsuo Handa has reported [1] that direct reclaimers might get stuck in
too_many_isolated loop basically for ever because the last few pages on
the LRU lists are isolated by the kswapd which is stuck on fs locks when
doing the pageout. This in turn means that there is nobody to actually
trigger the oom killer and the system is basically unusable.
too_many_isolated has been introduced by 35cd78156c49 ("vmscan: throttle
direct reclaim when too many pages are isolated already") to prevent
from pre-mature oom killer invocations because back then no reclaim
progress could indeed trigger the OOM killer too early. But since the
oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection")
the allocation/reclaim retry loop considers all the reclaimable pages
and throttles the allocation at that layer so we can loosen the direct
reclaim throttling.
Make shrink_inactive_list loop over too_many_isolated bounded and returns
immediately when the situation hasn't resolved after the first sleep.
Replace congestion_wait by a simple schedule_timeout_interruptible because
we are not really waiting on the IO congestion in this path.
Please note that this patch can theoretically cause the OOM killer to
trigger earlier while there are many pages isolated for the reclaim
which makes progress only very slowly. This would be obvious from the oom
report as the number of isolated pages are printed there. If we ever hit
this should_reclaim_retry should consider those numbers in the evaluation
in one way or another.
[1] http://lkml.kernel.org/r/201602092349.ACG81273.OSVtMJQHLOFOFF@xxxxxxxxxxxxxxxxxxx
Signed-off-by: Michal Hocko <mhocko@xxxxxxxx>
---
 mm/vmscan.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a60066d4521b..d07380ba1f9e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1718,9 +1718,15 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	int file = is_file_lru(lru);
 	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
+	bool stalled = false;
 
 	while (unlikely(too_many_isolated(pgdat, file, sc))) {
-		congestion_wait(BLK_RW_ASYNC, HZ/10);
+		if (stalled)
+			return 0;
+
+		/* wait a bit for the reclaimer. */
+		schedule_timeout_interruptible(HZ/10);
+		stalled = true;
 
 		/* We are about to die and free our memory. Return now. */
 		if (fatal_signal_pending(current))
-- 
2.11.0
-- 
Michal Hocko
SUSE Labs