Re: [v10 4/6] mm, oom: introduce memory.oom_group

From: Michal Hocko
Date: Thu Oct 05 2017 - 08:06:57 EST


On Wed 04-10-17 16:46:36, Roman Gushchin wrote:
> The cgroup-aware OOM killer treats leaf memory cgroups as memory
> consumption entities and performs the victim selection by comparing
> them based on their memory footprint. Then it kills the biggest task
> inside the selected memory cgroup.
>
> But there are workloads, which are not tolerant to a such behavior.
> Killing a random task may leave the workload in a broken state.
>
> To solve this problem, memory.oom_group knob is introduced.
> It will define, whether a memory group should be treated as an
> indivisible memory consumer, compared by total memory consumption
> with other memory consumers (leaf memory cgroups and other memory
> cgroups with memory.oom_group set), and whether all belonging tasks
> should be killed if the cgroup is selected.
>
> If set on memcg A, it means that in case of system-wide OOM or
> memcg-wide OOM scoped to A or any ancestor cgroup, all tasks,
> belonging to the sub-tree of A will be killed. If OOM event is
> scoped to a descendant cgroup (A/B, for example), only tasks in
> that cgroup can be affected. OOM killer will never touch any tasks
> outside of the scope of the OOM event.
>
> Also, tasks with oom_score_adj set to -1000 will not be killed.

I would extend the last sentence with an explanation. What about the
following:
"
Also, tasks with oom_score_adj set to -1000 will not be killed because
this has been a long established way to protect a particular process
from seeing an unexpected SIGKILL from the oom killer. Ignoring this
user defined configuration might lead to data corruptions or other
misbehavior.
"

few mostly nit picks below but this looks good other than that. Once the
fix mentioned in patch 3 is folded I will ack this.

[...]

> static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
> {
> - struct mem_cgroup *iter;
> + struct mem_cgroup *iter, *group = NULL;
> + long group_score = 0;
>
> oc->chosen_memcg = NULL;
> oc->chosen_points = 0;
>
> /*
> + * If OOM is memcg-wide, and the memcg has the oom_group flag set,
> + * all tasks belonging to the memcg should be killed.
> + * So, we mark the memcg as a victim.
> + */
> + if (oc->memcg && mem_cgroup_oom_group(oc->memcg)) {

we have is_memcg_oom() helper which is esier to read and understand than
the explicit oc->memcg check

> + oc->chosen_memcg = oc->memcg;
> + css_get(&oc->chosen_memcg->css);
> + return;
> + }
> +
> + /*
> * The oom_score is calculated for leaf memory cgroups (including
> * the root memcg).
> + * Non-leaf oom_group cgroups accumulating score of descendant
> + * leaf memory cgroups.
> */
> rcu_read_lock();
> for_each_mem_cgroup_tree(iter, root) {
> long score;
>
> + /*
> + * We don't consider non-leaf non-oom_group memory cgroups
> + * as OOM victims.
> + */
> + if (memcg_has_children(iter) && !mem_cgroup_oom_group(iter))
> + continue;
> +
> + /*
> + * If group is not set or we've ran out of the group's sub-tree,
> + * we should set group and reset group_score.
> + */
> + if (!group || group == root_mem_cgroup ||
> + !mem_cgroup_is_descendant(iter, group)) {
> + group = iter;
> + group_score = 0;
> + }
> +

hmm, I thought you would go with a recursive oom_evaluate_memcg
implementation that would result in a more readable code IMHO. It is
true that we would traverse oom_group more times. But I do not expect
we would have very deep memcg hierarchies in the majority of workloads
and even if we did then this is a cold path which should focus on
readability more than a performance. Also implementing
mem_cgroup_iter_skip_subtree shouldn't be all that hard if this ever
turns out a real problem.

Anyway this is nothing really fundamental so I will leave the decision
on you.

> +static bool oom_kill_memcg_victim(struct oom_control *oc)
> +{
> if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM)
> return oc->chosen_memcg;
>
> - /* Kill a task in the chosen memcg with the biggest memory footprint */
> - oc->chosen_points = 0;
> - oc->chosen_task = NULL;
> - mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc);
> -
> - if (oc->chosen_task == NULL || oc->chosen_task == INFLIGHT_VICTIM)
> - goto out;
> -
> - __oom_kill_process(oc->chosen_task);
> + /*
> + * If memory.oom_group is set, kill all tasks belonging to the sub-tree
> + * of the chosen memory cgroup, otherwise kill the task with the biggest
> + * memory footprint.
> + */
> + if (mem_cgroup_oom_group(oc->chosen_memcg)) {
> + mem_cgroup_scan_tasks(oc->chosen_memcg, oom_kill_memcg_member,
> + NULL);
> + /* We have one or more terminating processes at this point. */
> + oc->chosen_task = INFLIGHT_VICTIM;

it took me a while to realize we need this because of return
!!oc->chosen_task in out_of_memory. Subtle... Also a reason to hate
oc->chosen_* thingy. As I've said in other reply, don't worry about this
I will probably turn my hate into a patch ;)

> + } else {
> + oc->chosen_points = 0;
> + oc->chosen_task = NULL;
> + mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc);
> +
> + if (oc->chosen_task == NULL ||
> + oc->chosen_task == INFLIGHT_VICTIM)
> + goto out;

How can this happen? There shouldn't be any INFLIGHT_VICTIM in our memcg
because we have checked for that already. I can see how we do not find
any task because those can terminate by the time we get here but no new
oom victim should appear we are under the oom_lock.

> +
> + __oom_kill_process(oc->chosen_task);
> + }
>
> out:
> mem_cgroup_put(oc->chosen_memcg);
> --
> 2.13.6

--
Michal Hocko
SUSE Labs