Re: [PATCH 4/5] oom: don't kill random process

From: KOSAKI Motohiro
Date: Mon May 23 2011 - 21:35:20 EST


(2011/05/24 7:32), David Rientjes wrote:
On Fri, 20 May 2011, KOSAKI Motohiro wrote:

CAI Qian reported oom-killer killed all system daemons in his
system at first if he ran fork bomb as root. The problem is,
current logic give them bonus of 3% of system ram. Example,
he has 16GB machine, then root processes have ~500MB oom
immune. It bring us crazy bad result. _all_ processes have
oom-score=1 and then, oom killer ignore process memory usage
and kill random process. This regression is caused by commit
a63d83f427 (oom: badness heuristic rewrite).

This patch changes select_bad_process() slightly. If oom points == 1,
it's a sign that the system have only root privileged processes or
similar. Thus, select_bad_process() calculate oom badness without
root bonus and select eligible process.


You said earlier that you thought it was a good idea to do a proportional
based bonus for root processes. Do you have a specific objection to
giving root processes a 1% bonus for every 10% of used memory instead?

Because it's completely another topic. You have to maek another patch.



Also, this patch move finding sacrifice child logic into
select_bad_process(). It's necessary to implement adequate
no root bonus recalculation. and it makes good side effect,
current logic doesn't behave as the doc.


This is unnecessary and just makes the oom killer egregiously long. We
are already diagnosing problems here at Google where the oom killer holds
tasklist_lock on the readside for far too long, causing other cpus waiting
for a write_lock_irq(&tasklist_lock) to encounter issues when irqs are
disabled and it is spinning. A second tasklist scan is simply a
non-starter.

[ This is also one of the reasons why we needed to introduce
mm->oom_disable_count to prevent a second, expensive tasklist scan. ]

You misunderstand the code. Both select_bad_process() and oom_kill_process()
are under tasklist_lock(). IOW, no change lock holding time.


Documentation/sysctl/vm.txt says

oom_kill_allocating_task

If this is set to non-zero, the OOM killer simply kills the task that
triggered the out-of-memory condition. This avoids the expensive
tasklist scan.

IOW, oom_kill_allocating_task shouldn't search sacrifice child.
This patch also fixes this issue.


oom_kill_allocating_task was introduced for SGI to prevent the expensive
tasklist scan, the task that is actually allocating the memory isn't
actually interesting and is usually random. This should be turned into a
documentation fix rather than changing the implementation.

No benefit. I don't take it.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/