[PATCH] mm,page_alloc: Serialize out_of_memory() and allocation stall messages.

From: Tetsuo Handa
Date: Thu Oct 19 2017 - 06:51:16 EST


The printk() flooding problem caused by concurrent warn_alloc() calls was
already pointed out by me, and there are reports of soft lockups caused by
warn_alloc(). But this problem is left unhandled because Michal does not
like serialization from allocation path because he is worrying about
unexpected side effects and is asking to identify the root cause of soft
lockups and fix it. But at least consuming CPU resource by not serializing
concurrent printk() plays some role in the soft lockups, for currently
printk() can consume CPU resource forever as long as somebody is appending
to printk() buffer, and writing to consoles also needs CPU resource. That
is, needlessly consuming CPU resource when calling printk() has unexpected
side effects.

Although a proposal for offloading writing to consoles to a dedicated
kernel thread is in progress, it is not yet accepted. And, even after
the proposal is accepted, writing to printk() buffer faster than the
kernel thread can write to consoles will result in loss of messages.
We should refrain from "appending to printk() buffer" and "consuming CPU
resource" at the same time if possible. We should try to (and we can)
avoid appending to printk() buffer when printk() is concurrently called
for reporting the OOM killer and allocation stalls, in order to reduce
possibility of hitting soft lockups and getting unreadably-jumbled
messages.

Although avoid mixing both memory allocation stall/failure messages and
the OOM killer messages would be nice, oom_lock mutex should not be used
for this purpose, for waiting for oom_lock mutex at warn_alloc() can
prevent anybody from calling out_of_memory() from __alloc_pages_may_oom()
because currently __alloc_pages_may_oom() does not wait for oom_lock
(i.e. causes OOM lockups after all). Therefore, this patch adds a mutex
named "oom_printk_lock". Although using mutex_lock() in order to allow
printk() to use CPU resource for writing to consoles is better from the
point of view of flushing printk(), this patch uses mutex_trylock() for
allocation stall messages because Michal does not like serialization.

Signed-off-by: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx>
Reported-by: Cong Wang <xiyou.wangcong@xxxxxxxxx>
Reported-by: yuwang.yuwang <yuwang.yuwang@xxxxxxxxxxxxxxx>
Reported-by: Johannes Weiner <hannes@xxxxxxxxxxx>
Cc: Sergey Senozhatsky <sergey.senozhatsky@xxxxxxxxx>
Cc: Petr Mladek <pmladek@xxxxxxxx>
Cc: Michal Hocko <mhocko@xxxxxxxxxx>
---
include/linux/oom.h | 1 +
mm/oom_kill.c | 5 +++++
mm/page_alloc.c | 4 +++-
3 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 76aac4c..1425767 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -44,6 +44,7 @@ struct oom_control {
};

extern struct mutex oom_lock;
+extern struct mutex oom_printk_lock;

static inline void set_current_oom_origin(void)
{
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 26add8a..5aef9a6 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -54,6 +54,7 @@
int sysctl_oom_dump_tasks = 1;

DEFINE_MUTEX(oom_lock);
+DEFINE_MUTEX(oom_printk_lock);

#ifdef CONFIG_NUMA
/**
@@ -1074,7 +1075,9 @@ bool out_of_memory(struct oom_control *oc)
current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
get_task_struct(current);
oc->chosen = current;
+ mutex_lock(&oom_printk_lock);
oom_kill_process(oc, "Out of memory (oom_kill_allocating_task)");
+ mutex_unlock(&oom_printk_lock);
return true;
}

@@ -1085,8 +1088,10 @@ bool out_of_memory(struct oom_control *oc)
panic("Out of memory and no killable processes...\n");
}
if (oc->chosen && oc->chosen != (void *)-1UL) {
+ mutex_lock(&oom_printk_lock);
oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
"Memory cgroup out of memory");
+ mutex_unlock(&oom_printk_lock);
/*
* Give the killed process a good chance to exit before trying
* to allocate memory again.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 97687b3..3766687 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3990,11 +3990,13 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
goto nopage;

/* Make sure we know about allocations which stall for too long */
- if (time_after(jiffies, alloc_start + stall_timeout)) {
+ if (time_after(jiffies, alloc_start + stall_timeout) &&
+ mutex_trylock(&oom_printk_lock)) {
warn_alloc(gfp_mask & ~__GFP_NOWARN, ac->nodemask,
"page allocation stalls for %ums, order:%u",
jiffies_to_msecs(jiffies-alloc_start), order);
stall_timeout += 10 * HZ;
+ mutex_unlock(&oom_printk_lock);
}

/* Avoid recursion of direct reclaim */
--
1.8.3.1