Re: [v9 3/5] mm, oom: cgroup-aware OOM killer

From: Michal Hocko
Date: Tue Oct 03 2017 - 07:48:56 EST


On Wed 27-09-17 14:09:34, Roman Gushchin wrote:
> Traditionally, the OOM killer is operating on a process level.
> Under oom conditions, it finds a process with the highest oom score
> and kills it.
>
> This behavior doesn't suit well the system with many running
> containers:
>
> 1) There is no fairness between containers. A small container with
> few large processes will be chosen over a large one with huge
> number of small processes.
>
> 2) Containers often do not expect that some random process inside
> will be killed. In many cases much safer behavior is to kill
> all tasks in the container. Traditionally, this was implemented
> in userspace, but doing it in the kernel has some advantages,
> especially in a case of a system-wide OOM.
>
> To address these issues, the cgroup-aware OOM killer is introduced.
>
> Under OOM conditions, it looks for the biggest memory consumer:
> a leaf memory cgroup or a memory cgroup with the memory.oom_group
> option set. Then it kills either a task with the biggest memory
> footprint, either all belonging tasks, if memory.oom_group is set.
> If a cgroup has memory.oom_group set, all descendant cgroups
> implicitly inherit the memory.oom_group setting.

I think it would be better to separate oom_group into its own patch.
So this patch would just add the cgroup awareness and oom_group will
build on top of that.

Wrt. to the implicit inheritance you brought up in a separate email
thread [1]. Let me quote
: after some additional thinking I don't think anymore that implicit
: propagation of oom_group is a good idea. Let me explain: assume we
: have memcg A with memory.max and memory.oom_group set, and nested
: memcg A/B with memory.max set. Let's imagine we have an OOM event if
: A/B. What is an expected system behavior?
: We have OOM scoped to A/B, and any action should be also scoped to A/B.
: We really shouldn't touch processes which are not belonging to A/B.
: That means we should either kill the biggest process in A/B, either all
: processes in A/B. It's natural to make A/B/memory.oom_group responsible
: for this decision. It's strange to make the depend on A/memory.oom_group, IMO.
: It really makes no sense, and makes oom_group knob really hard to describe.
:
: Also, after some off-list discussion, we've realized that memory.oom_knob
: should be delegatable. The workload should have control over it to express
: dependency between processes.

OK, I have asked about this already but I am not sure the answer was
very explicit. So let me ask again. When exactly a subtree would
disagree with the parent on oom_group? In other words when do we want a
different cleanup based on the OOM root? I am not saying this is wrong
I am just curious about a practical example.

> Tasks with oom_score_adj set to -1000 are considered as unkillable.
>
> The root cgroup is treated as a leaf memory cgroup, so it's score
> is compared with other leaf and oom_group memory cgroups.
> The oom_group option is not supported for the root cgroup.
> Due to memcg statistics implementation a special algorithm
> is used for estimating root cgroup oom_score: we define it
> as maximum oom_score of the belonging tasks.

[1] http://lkml.kernel.org/r/20171002124712.GA17638@xxxxxxxxxxxxxxxxxxxxxxxxxxx

[...]
> +static long memcg_oom_badness(struct mem_cgroup *memcg,
> + const nodemask_t *nodemask,
> + unsigned long totalpages)
> +{
> + long points = 0;
> + int nid;
> + pg_data_t *pgdat;
> +
> + /*
> + * We don't have necessary stats for the root memcg,
> + * so we define it's oom_score as the maximum oom_score
> + * of the belonging tasks.
> + */

Why not a sum of all tasks which would more resemble what we do for
other memcgs? Sure this would require ignoring oom_score_adj so
oom_badness would have to be tweaked a bit (basically split it into
__oom_badness which calculates the value without the bias and
oom_badness on top adding the bias on top of the scaled value).

> + if (memcg == root_mem_cgroup) {
> + struct css_task_iter it;
> + struct task_struct *task;
> + long score, max_score = 0;
> +
> + css_task_iter_start(&memcg->css, 0, &it);
> + while ((task = css_task_iter_next(&it))) {
> + score = oom_badness(task, memcg, nodemask,
> + totalpages);
> + if (score > max_score)
> + max_score = score;
> + }
> + css_task_iter_end(&it);
> +
> + return max_score;
> + }
> +
> + for_each_node_state(nid, N_MEMORY) {
> + if (nodemask && !node_isset(nid, *nodemask))
> + continue;
> +
> + points += mem_cgroup_node_nr_lru_pages(memcg, nid,
> + LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
> +
> + pgdat = NODE_DATA(nid);
> + points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg),
> + NR_SLAB_UNRECLAIMABLE);
> + }
> +
> + points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
> + (PAGE_SIZE / 1024);
> + points += memcg_page_state(memcg, MEMCG_SOCK);
> + points += memcg_page_state(memcg, MEMCG_SWAP);
> +
> + return points;
> +}
> +
> +/*
> + * Checks if the given memcg is a valid OOM victim and returns a number,
> + * which means the folowing:
> + * -1: there are inflight OOM victim tasks, belonging to the memcg
> + * 0: memcg is not eligible, e.g. all belonging tasks are protected
> + * by oom_score_adj set to OOM_SCORE_ADJ_MIN
> + * >0: memcg is eligible, and the returned value is an estimation
> + * of the memory footprint
> + */
> +static long oom_evaluate_memcg(struct mem_cgroup *memcg,
> + const nodemask_t *nodemask,
> + unsigned long totalpages)
> +{
> + struct css_task_iter it;
> + struct task_struct *task;
> + int eligible = 0;
> +
> + /*
> + * Memcg is OOM eligible if there are OOM killable tasks inside.
> + *
> + * We treat tasks with oom_score_adj set to OOM_SCORE_ADJ_MIN
> + * as unkillable.
> + *
> + * If there are inflight OOM victim tasks inside the memcg,
> + * we return -1.
> + */
> + css_task_iter_start(&memcg->css, 0, &it);
> + while ((task = css_task_iter_next(&it))) {
> + if (!eligible &&
> + task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN)
> + eligible = 1;
> +
> + if (tsk_is_oom_victim(task) &&
> + !test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) {
> + eligible = -1;
> + break;
> + }
> + }
> + css_task_iter_end(&it);
> +
> + if (eligible <= 0)
> + return eligible;
> +
> + return memcg_oom_badness(memcg, nodemask, totalpages);
> +}
> +
> +static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
> +{
> + struct mem_cgroup *iter, *parent;
> +
> + /*
> + * If OOM is memcg-wide, and the memcg or it's ancestor has
> + * the oom_group flag, simple select the memcg as a victim.
> + */
> + if (oc->memcg && mem_cgroup_oom_group(oc->memcg)) {
> + oc->chosen_memcg = oc->memcg;
> + css_get(&oc->chosen_memcg->css);
> + oc->chosen_points = oc->memcg->oom_score;
> + return;
> + }
> +
> + oc->chosen_memcg = NULL;
> +
> + /*
> + * The oom_score is calculated for leaf memcgs and propagated upwards
> + * by the tree.
> + *
> + * for_each_mem_cgroup_tree() walks the tree in pre-order,
> + * so we simple reset oom_score for non-lead cgroups before
> + * starting accumulating an actual value from underlying sub-tree.
> + *
> + * Root memcg is treated as a leaf memcg.
> + */
> + rcu_read_lock();
> + for_each_mem_cgroup_tree(iter, root) {
> + if (memcg_has_children(iter) && iter != root_mem_cgroup) {
> + iter->oom_score = 0;
> + continue;
> + }
> +
> + iter->oom_score = oom_evaluate_memcg(iter, oc->nodemask,
> + oc->totalpages);
> +
> + /*
> + * Ignore empty and non-eligible memory cgroups.
> + */
> + if (iter->oom_score == 0)
> + continue;
> +
> + /*
> + * If there are inflight OOM victims, we don't need to look
> + * further for new victims.
> + */
> + if (iter->oom_score == -1) {
> + oc->chosen_memcg = INFLIGHT_VICTIM;
> + mem_cgroup_iter_break(root, iter);
> + break;
> + }
> +
> + if (iter->oom_score > oc->chosen_points) {
> + oc->chosen_memcg = iter;
> + oc->chosen_points = iter->oom_score;
> + }
> +
> + for (parent = parent_mem_cgroup(iter); parent && parent != root;
> + parent = parent_mem_cgroup(parent)) {
> + parent->oom_score += iter->oom_score;
> +
> + if (mem_cgroup_oom_group(parent) &&
> + parent->oom_score > oc->chosen_points) {
> + oc->chosen_memcg = parent;
> + oc->chosen_points = parent->oom_score;
> + }
> + }
> + }
> +
> + if (oc->chosen_memcg && oc->chosen_memcg != INFLIGHT_VICTIM)
> + css_get(&oc->chosen_memcg->css);
> +
> + rcu_read_unlock();
> +}


As I've written in a private email, things will get much easier if you
get rid of memcg->oom_score and simply do the recursive oom_score
evaluation of eligible inter nodes. You would basically do
for_each_mem_cgroup_tree(root, iter) {
if (!memcg_oom_eligible(iter))
continue;

oom_score = oom_evaluate_memcg(iter, mask);
if (oom_score == -1) {
oc->chosen_memcg = INFLIGHT_VICTIM;
mem_cgroup_iter_break(root, iter);
break;
}
if (oom_score > oc->chosen_points) {
mark_new_oom_memcg(iter);
}

/* potential optimization to skip the whole subtree if
* iter is not leaf */
}

where
bool memcg_oom_eligible(struct mem_cgroup *memcg)
{
if (cgroup_has_tasks(memcg->css.cgroup))
return true;
if (mem_cgroup_oom_group(memcg))
return true;
return false;
}

unsigned long __oom_evaluate_memcg(struct mem_cgroup *memcg, mask)
{
/* check eligible tasks - oom victims OOM_SCORE_ADJ_MIN */
/* calculate badness */
}

unsigned long oom_evaluate_memcg(struct mem_cgroup *memcg, mask)
{
unsigned long score = 0;

if (memcg == root_mem_cgroup) {
for_each_task()
score += __oom_badness(task, mask);
return score
}

for_each_mem_cgroup_tree(memcg, iter) {
unsigned long memcg_score = __oom_evaluate_memcg(iter, mask);
if (memcg_score == -1) {
mem_cgroup_iter_break(memcg, iter);
return -1;
}
}

return score;
}

This should be also simple to split for oom_group in a separate patch
while keeping the overall code structure.
Does this make any sense to you?

[...]
> @@ -962,6 +968,48 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
> __oom_kill_process(victim);
> }
>
> +static int oom_kill_memcg_member(struct task_struct *task, void *unused)
> +{
> + if (!tsk_is_oom_victim(task)) {

How can this happen?

> + get_task_struct(task);
> + __oom_kill_process(task);
> + }
> + return 0;
> +}
> +
> +static bool oom_kill_memcg_victim(struct oom_control *oc)
> +{
> + static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
> + DEFAULT_RATELIMIT_BURST);
> +
> + if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM)
> + return oc->chosen_memcg;
> +
> + /* Always begin with the task with the biggest memory footprint */
> + oc->chosen_points = 0;
> + oc->chosen_task = NULL;
> + mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc);
> +
> + if (oc->chosen_task == NULL || oc->chosen_task == INFLIGHT_VICTIM)
> + goto out;
> +
> + if (__ratelimit(&oom_rs))
> + dump_header(oc, oc->chosen_task);

Hmm, does the full dump_header really apply for the new heuristic? E.g.
does it make sense to dump_tasks()? Would it make sense to print stats
of all eligible memcgs instead?

> +
> + __oom_kill_process(oc->chosen_task);
> +
> + /* If oom_group flag is set, kill all belonging tasks */
> + if (mem_cgroup_oom_group(oc->chosen_memcg))
> + mem_cgroup_scan_tasks(oc->chosen_memcg, oom_kill_memcg_member,
> + NULL);
> +
> + schedule_timeout_killable(1);

I would prefer if we had this timeout at a single place in
out_of_memory()

Other than that the semantic (sans oom_group which needs more
clarification) makes sense to me.
--
Michal Hocko
SUSE Labs