[PATCH] oom, suspend: fix oom_reaper vs. oom_killer_disable race

From: Michal Hocko
Date: Mon Jun 20 2016 - 11:28:18 EST


Tetsuo has reported the following potential oom_killer_disable vs.
oom_reaper race:

(1) freeze_processes() starts freezing user space threads.
(2) Somebody (maybe a kenrel thread) calls out_of_memory().
(3) The OOM killer calls mark_oom_victim() on a user space thread
P1 which is already in __refrigerator().
(4) oom_killer_disable() sets oom_killer_disabled = true.
(5) P1 leaves __refrigerator() and enters do_exit().
(6) The OOM reaper calls exit_oom_victim(P1) before P1 can call
exit_oom_victim(P1).
(7) oom_killer_disable() returns while P1 not yet finished
(8) P1 perform IO/interfere with the freezer.

This patch effectivelly reverts 449d777d7ad6 ("mm, oom_reaper: clear
TIF_MEMDIE for all tasks queued for oom_reaper"). We have a more
efficient way to achieve the same. We can rely on MMF_REAPED because
"mm, oom: hide mm which is shared with kthread or global init" made sure
that we skip tasks with MMF_REAPED even when they have TIF_MEMDIE.

The only missing part is that we have to handle __oom_reap_task failure
gracefully. Currently we mark the mm by MMF_OOM_NOT_REAPABLE to later
set MMF_REAPED should we fail __oom_reap_task again. This relies on
clearing TIF_MEMDIE to select that task again. We can achieve a similar
thing by requeuing the task to the oom_reaper_list. Introduce
oom_reaper_add_tail to add the task to the tail of the oom_reaper_list.
While we are at it reuse it for wake_oom_reaper as well because the
queueing should be FIFO.

In the result only the oom victim will be responsible to clear its
TIF_MEMDIE flag so the above race is not possible anymore. oom_reaper
will be allowed to play only with MMF_* flags so the responsibility
model will be more clear.

Fixes: 449d777d7ad6 ("mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for oom_reaper")
Reported-by: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx>
Signed-off-by: Michal Hocko <mhocko@xxxxxxxx>
---
mm/oom_kill.c | 86 ++++++++++++++++++++++++++++++++++++++---------------------
1 file changed, 55 insertions(+), 31 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 4c21f744daa6..dbc62a69fee8 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -558,6 +558,51 @@ static bool __oom_reap_task(struct task_struct *tsk)
return ret;
}

+static void oom_reaper_add_tail(struct task_struct *tsk)
+{
+ struct task_struct *q;
+
+ spin_lock(&oom_reaper_lock);
+ if (!oom_reaper_list) {
+ oom_reaper_list = tsk;
+ goto unlock;
+ }
+
+ for (q = oom_reaper_list; q->oom_reaper_list; q = q->oom_reaper_list)
+ ;
+ q->oom_reaper_list = tsk;
+
+unlock:
+ spin_unlock(&oom_reaper_lock);
+}
+
+static inline bool requeue_oom_victim(struct task_struct *tsk)
+{
+ struct task_struct *p;
+ bool ret;
+
+ /*
+ * If we've already tried to reap this task in the past and
+ * failed it probably doesn't make much sense to try yet again
+ * so hide the mm from the oom killer so that it can move on
+ * to another task with a different mm struct.
+ */
+ p = find_lock_task_mm(tsk);
+ if (!p)
+ return false;
+
+ if (test_and_set_bit(MMF_OOM_NOT_REAPABLE, &p->mm->flags))
+ set_bit(MMF_OOM_REAPED, &p->mm->flags);
+ else
+ ret = true;
+
+ task_unlock(p);
+
+ if (ret)
+ oom_reaper_add_tail(tsk);
+ return ret;
+}
+
#define MAX_OOM_REAP_RETRIES 10
static void oom_reap_task(struct task_struct *tsk)
{
@@ -567,40 +612,23 @@ static void oom_reap_task(struct task_struct *tsk)
while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task(tsk))
schedule_timeout_idle(HZ/10);

- if (attempts > MAX_OOM_REAP_RETRIES) {
- struct task_struct *p;
+ tsk->oom_reaper_list = NULL;

+ if (attempts > MAX_OOM_REAP_RETRIES) {
pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
task_pid_nr(tsk), tsk->comm);

- /*
- * If we've already tried to reap this task in the past and
- * failed it probably doesn't make much sense to try yet again
- * so hide the mm from the oom killer so that it can move on
- * to another task with a different mm struct.
- */
- p = find_lock_task_mm(tsk);
- if (p) {
- if (test_and_set_bit(MMF_OOM_NOT_REAPABLE, &p->mm->flags)) {
- pr_info("oom_reaper: giving up pid:%d (%s)\n",
- task_pid_nr(tsk), tsk->comm);
- set_bit(MMF_OOM_REAPED, &p->mm->flags);
- }
- task_unlock(p);
+ debug_show_all_locks();
+
+ if (requeue_oom_victim(tsk)) {
+ cond_resched();
+ return;
}

- debug_show_all_locks();
+ pr_info("oom_reaper: giving up pid:%d (%s)\n",
+ task_pid_nr(tsk), tsk->comm);
}

- /*
- * Clear TIF_MEMDIE because the task shouldn't be sitting on a
- * reasonably reclaimable memory anymore or it is not a good candidate
- * for the oom victim right now because it cannot release its memory
- * itself nor by the oom reaper.
- */
- tsk->oom_reaper_list = NULL;
- exit_oom_victim(tsk);
-
/* Drop a reference taken by wake_oom_reaper */
put_task_struct(tsk);
}
@@ -637,11 +665,7 @@ void wake_oom_reaper(struct task_struct *tsk)
return;

get_task_struct(tsk);
-
- spin_lock(&oom_reaper_lock);
- tsk->oom_reaper_list = oom_reaper_list;
- oom_reaper_list = tsk;
- spin_unlock(&oom_reaper_lock);
+ oom_reaper_add_tail(tsk);
wake_up(&oom_reaper_wait);
}

--
2.8.1

--
Michal Hocko
SUSE Labs