Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks

From: David Rientjes
Date: Mon Sep 21 2015 - 19:27:38 EST


On Fri, 18 Sep 2015, Christoph Lameter wrote:

> Subject: Allow multiple kills from the OOM killer
>
> The OOM killer currently aborts if it finds a process that already is having
> access to the reserve memory pool for exit processing. This is done so that
> the reserves are not overcommitted but on the other hand this also allows
> only one process being oom killed at the time. That process may be stuck
> in D state.
>
> Signed-off-by: Christoph Lameter <cl@xxxxxxxxx>
>
> Index: linux/mm/oom_kill.c
> ===================================================================
> --- linux.orig/mm/oom_kill.c 2015-09-18 11:58:52.963946782 -0500
> +++ linux/mm/oom_kill.c 2015-09-18 11:59:42.010684778 -0500
> @@ -264,10 +264,9 @@ enum oom_scan_t oom_scan_process_thread(
> * This task already has access to memory reserves and is being killed.
> * Don't allow any other task to have access to the reserves.
> */
> - if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
> - if (oc->order != -1)
> - return OOM_SCAN_ABORT;
> - }
> + if (test_tsk_thread_flag(task, TIF_MEMDIE))
> + return OOM_SCAN_CONTINUE;
> +
> if (!task->mm)
> return OOM_SCAN_CONTINUE;
>

If this would result in the newly chosen process being guaranteed to exit,
this would be fine. Unfortunately, no such guarantee is possible. If a
thread is holding a contended mutex that the victim(s) require, this
serial oom killer could eventually panic the system if that thread is
OOM_DISABLE.

The solution that we have merged internally is described at
http://marc.info/?l=linux-kernel&m=144010444913702 -- we provide access to
memory reserves to processes that find a stalled exit in the oom killer so
that they may allocate. It comes along with a test module that takes a
contended mutex and ensures that forward progress is made as long as
memory reserves are not depleted. We can't actually guarantee that memory
reserves won't be depleted, but we (1) hope that nobody is actually
allocating a lot of memory before dropping a mutex and (2) want to avoid
the alternative which is a system livelock.

This will address situations such as

allocator oom victim
--------- ----------
mutex_lock(lock)
alloc_pages(GFP_KERNEL)
mutex_lock(lock)
mutex_unlock(lock)
handle SIGKILL

since this otherwise results in a livelock without a solution such as
mine since the GFP_KERNEL allocation stalls forever waiting for the oom
victim to acquire the mutex and exit. This also works if the allocator is
OOM_DISABLE.

This won't handle other situations where the victim gets wedged in D state
and is not allocating memory, but this is by far the more common
occurrence that we have dealt with.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/