Re: [PATCH 3/3] cgroup: Implement cgroup2 basic CPU usage accounting

From: Waiman Long
Date: Fri Aug 11 2017 - 16:19:15 EST


On 08/11/2017 12:37 PM, Tejun Heo wrote:
> In cgroup1, while cpuacct isn't actually controlling any resources, it
> is a separate controller due to combinaton of two factors -
> 1. enabling cpu controller has significant side effects, and 2. we
> have to pick one of the hierarchies to account CPU usages on. cpuacct
> controller is effectively used to designate a hierarchy to track CPU
> usages on.
>
> cgroup2's unified hierarchy removes the second reason and we can
> account basic CPU usages by default. While we can use cpuacct for
> this purpose, both its interface and implementation leave a lot to be
> desired - it collects and exposes two sources of truth which don't
> agree with each other and some of the exposed statistics don't make
> much sense. Also, it propagates all the way up the hierarchy on each
> accounting event which is unnecessary.
>
> This patch adds basic resource accounting mechanism to cgroup2's
> unified hierarchy and accounts CPU usages using it.
>
> * All accountings are done per-cpu and don't propagate immediately.
> It just bumps the per-cgroup per-cpu counters and links to the
> parent's updated list if not already on it.
>
> * On a read, the per-cpu counters are collected into the global ones
> and then propagated upwards. Only the per-cpu counters which have
> changed since the last read are propagated.
>
> * CPU usage stats are collected and shown in "cgroup.stat" with "cpu."
> prefix. Total usage is collected from scheduling events. User/sys
> breakdown is sourced from tick sampling and adjusted to the usage
> using cputime_adjuts().

Typo: cputime_adjust()

> This keeps the accounting side hot path O(1) and per-cpu and the read
> side O(nr_updated_since_last_read).
>
> Signed-off-by: Tejun Heo <tj@xxxxxxxxxx>
> Cc: Ingo Molnar <mingo@xxxxxxxxxx>
> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> Cc: Li Zefan <lizefan@xxxxxxxxxx>
> Cc: Johannes Weiner <hannes@xxxxxxxxxxx>
> ---
> Documentation/cgroup-v2.txt | 9 ++
> include/linux/cgroup-defs.h | 47 ++++++
> include/linux/cgroup.h | 22 +++
> kernel/cgroup/Makefile | 2 +-
> kernel/cgroup/cgroup-internal.h | 8 +
> kernel/cgroup/cgroup.c | 24 ++-
> kernel/cgroup/stat.c | 333 ++++++++++++++++++++++++++++++++++++++++
> 7 files changed, 442 insertions(+), 3 deletions(-)
> create mode 100644 kernel/cgroup/stat.c
>
> diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
> index dc44785..3f82169 100644
> --- a/Documentation/cgroup-v2.txt
> +++ b/Documentation/cgroup-v2.txt
> @@ -886,6 +886,15 @@ All cgroup core files are prefixed with "cgroup."
> A dying cgroup can consume system resources not exceeding
> limits, which were active at the moment of cgroup deletion.
>
> + cpu.usage_usec
> + CPU time consumed in the subtree.
> +
> + cpu.user_usec
> + User CPU time consumed in the subtree.
> +
> + cpu.system_usec
> + System CPU time consumed in the subtree.
> +
>
> Controllers
> ===========
> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> index 59e4ad9..17da9c8 100644
> --- a/include/linux/cgroup-defs.h
> +++ b/include/linux/cgroup-defs.h
> @@ -16,6 +16,7 @@
> #include <linux/refcount.h>
> #include <linux/percpu-refcount.h>
> #include <linux/percpu-rwsem.h>
> +#include <linux/u64_stats_sync.h>
> #include <linux/workqueue.h>
> #include <linux/bpf-cgroup.h>
>
> @@ -249,6 +250,47 @@ struct css_set {
> struct rcu_head rcu_head;
> };
>
> +/*
> + * cgroup basic resource usage statistics. Accounting is done per-cpu in
> + * cgroup_cpu_stat which is then lazily propagated up the hierarchy on
> + * reads. The propagation is selective - only the cgroup_cpu_stats which
> + * have been updated since the last propagation are propagated.
> + */
> +struct cgroup_cpu_stat {
> + /*
> + * ->sync protects all the current counters. These are the only
> + * fields which get updated in the hot path.
> + */
> + struct u64_stats_sync sync;
> + struct task_cputime cputime;
> +
> + /*
> + * Snapshots at the last reading. These are used to calculate the
> + * deltas to propagate to the global counters.
> + */
> + struct task_cputime last_cputime;
> +
> + /*
> + * Child cgroups with stat updates on this cpu since the last read
> + * are linked on the parent's ->updated_children through
> + * ->updated_next.
> + *
> + * In addition to being more compact, singly-linked list pointing
> + * to the cgroup makes it unnecessary for each per-cpu struct to
> + * point back to the associated cgroup.
> + *
> + * Protected by per-cpu cgroup_cpu_stat_lock.
> + */
> + struct cgroup *updated_children; /* terminated by self cgroup */
> + struct cgroup *updated_next; /* NULL iff not on the list */
> +};
> +
> +struct cgroup_stat {
> + /* per-cpu statistics are collected into the folowing global counters */
> + struct task_cputime cputime;
> + struct prev_cputime prev_cputime;
> +};
> +
> struct cgroup {
> /* self css with NULL ->ss, points back to this cgroup */
> struct cgroup_subsys_state self;
> @@ -348,6 +390,11 @@ struct cgroup {
> */
> struct cgroup *dom_cgrp;
>
> + /* cgroup basic resource statistics */
> + struct cgroup_cpu_stat __percpu *cpu_stat;
> + struct cgroup_stat pending_stat; /* pending from children */
> + struct cgroup_stat stat;
> +
> /*
> * list of pidlists, up to two for each namespace (one for procs, one
> * for tasks); created on demand.
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index f395e02..4c82212 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -689,17 +689,39 @@ static inline void cpuacct_account_field(struct task_struct *tsk, int index,
> u64 val) {}
> #endif
>
> +void cgroup_stat_show_cputime(struct seq_file *seq, const char *prefix);
> +
> +void __cgroup_account_cputime(struct cgroup *cgrp, u64 delta_exec);
> +void __cgroup_account_cputime_field(struct cgroup *cgrp,
> + enum cpu_usage_stat index, u64 delta_exec);
> +
> static inline void cgroup_account_cputime(struct task_struct *task,
> u64 delta_exec)
> {
> + struct cgroup *cgrp;
> +
> cpuacct_charge(task, delta_exec);
> +
> + rcu_read_lock();
> + cgrp = task_dfl_cgroup(task);
> + if (cgroup_parent(cgrp))
> + __cgroup_account_cputime(cgrp, delta_exec);
> + rcu_read_unlock();
> }
>
> static inline void cgroup_account_cputime_field(struct task_struct *task,
> enum cpu_usage_stat index,
> u64 delta_exec)
> {
> + struct cgroup *cgrp;
> +
> cpuacct_account_field(task, index, delta_exec);
> +
> + rcu_read_lock();
> + cgrp = task_dfl_cgroup(task);
> + if (cgroup_parent(cgrp))
> + __cgroup_account_cputime_field(cgrp, index, delta_exec);
> + rcu_read_unlock();
> }
>
> #else /* CONFIG_CGROUPS */
> diff --git a/kernel/cgroup/Makefile b/kernel/cgroup/Makefile
> index ce693cc..0acee61 100644
> --- a/kernel/cgroup/Makefile
> +++ b/kernel/cgroup/Makefile
> @@ -1,4 +1,4 @@
> -obj-y := cgroup.o namespace.o cgroup-v1.o
> +obj-y := cgroup.o stat.o namespace.o cgroup-v1.o
>
> obj-$(CONFIG_CGROUP_FREEZER) += freezer.o
> obj-$(CONFIG_CGROUP_PIDS) += pids.o
> diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
> index c167a40..f17da09 100644
> --- a/kernel/cgroup/cgroup-internal.h
> +++ b/kernel/cgroup/cgroup-internal.h
> @@ -197,6 +197,14 @@ int cgroup_show_path(struct seq_file *sf, struct kernfs_node *kf_node,
> int cgroup_task_count(const struct cgroup *cgrp);
>
> /*
> + * stat.c
> + */
> +void cgroup_stat_flush(struct cgroup *cgrp);
> +int cgroup_stat_init(struct cgroup *cgrp);
> +void cgroup_stat_exit(struct cgroup *cgrp);
> +void cgroup_stat_boot(void);
> +
> +/*
> * namespace.c
> */
> extern const struct proc_ns_operations cgroupns_operations;
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index c038ccf..a399a55 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -142,12 +142,14 @@ static struct static_key_true *cgroup_subsys_on_dfl_key[] = {
> };
> #undef SUBSYS
>
> +static DEFINE_PER_CPU(struct cgroup_cpu_stat, cgrp_dfl_root_cpu_stat);
> +
> /*
> * The default hierarchy, reserved for the subsystems that are otherwise
> * unattached - it never has more than a single cgroup, and all tasks are
> * part of that cgroup.
> */
> -struct cgroup_root cgrp_dfl_root;
> +struct cgroup_root cgrp_dfl_root = { .cgrp.cpu_stat = &cgrp_dfl_root_cpu_stat };
> EXPORT_SYMBOL_GPL(cgrp_dfl_root);
>
> /*
> @@ -3296,6 +3298,8 @@ static int cgroup_stat_show(struct seq_file *seq, void *v)
> seq_printf(seq, "nr_dying_descendants %d\n",
> cgroup->nr_dying_descendants);
>
> + cgroup_stat_show_cputime(seq, "cpu.");
> +
> return 0;
> }
>
> @@ -4466,6 +4470,8 @@ static void css_free_work_fn(struct work_struct *work)
> */
> cgroup_put(cgroup_parent(cgrp));
> kernfs_put(cgrp->kn);
> + if (cgroup_on_dfl(cgrp))
> + cgroup_stat_exit(cgrp);
> kfree(cgrp);
> } else {
> /*
> @@ -4510,6 +4516,9 @@ static void css_release_work_fn(struct work_struct *work)
> /* cgroup release path */
> trace_cgroup_release(cgrp);
>
> + if (cgroup_on_dfl(cgrp))
> + cgroup_stat_flush(cgrp);
> +
> for (tcgrp = cgroup_parent(cgrp); tcgrp;
> tcgrp = cgroup_parent(tcgrp))
> tcgrp->nr_dying_descendants--;
> @@ -4696,6 +4705,12 @@ static struct cgroup *cgroup_create(struct cgroup *parent)
> if (ret)
> goto out_free_cgrp;
>
> + if (cgroup_on_dfl(parent)) {
> + ret = cgroup_stat_init(cgrp);
> + if (ret)
> + goto out_cancel_ref;
> + }
> +
> /*
> * Temporarily set the pointer to NULL, so idr_find() won't return
> * a half-baked cgroup.
> @@ -4703,7 +4718,7 @@ static struct cgroup *cgroup_create(struct cgroup *parent)
> cgrp->id = cgroup_idr_alloc(&root->cgroup_idr, NULL, 2, 0, GFP_KERNEL);
> if (cgrp->id < 0) {
> ret = -ENOMEM;
> - goto out_cancel_ref;
> + goto out_stat_exit;
> }
>
> init_cgroup_housekeeping(cgrp);
> @@ -4752,6 +4767,9 @@ static struct cgroup *cgroup_create(struct cgroup *parent)
>
> return cgrp;
>
> +out_stat_exit:
> + if (cgroup_on_dfl(parent))
> + cgroup_stat_exit(cgrp);
> out_cancel_ref:
> percpu_ref_exit(&cgrp->self.refcnt);
> out_free_cgrp:
> @@ -5146,6 +5164,8 @@ int __init cgroup_init(void)
> BUG_ON(cgroup_init_cftypes(NULL, cgroup_base_files));
> BUG_ON(cgroup_init_cftypes(NULL, cgroup1_base_files));
>
> + cgroup_stat_boot();
> +
> /*
> * The latency of the synchronize_sched() is too high for cgroups,
> * avoid it at the cost of forcing all readers into the slow path.
> diff --git a/kernel/cgroup/stat.c b/kernel/cgroup/stat.c
> new file mode 100644
> index 0000000..19a10b2
> --- /dev/null
> +++ b/kernel/cgroup/stat.c
> @@ -0,0 +1,333 @@
> +#include "cgroup-internal.h"
> +
> +#include <linux/sched/cputime.h>
> +
> +static DEFINE_MUTEX(cgroup_stat_mutex);
> +static DEFINE_PER_CPU(raw_spinlock_t, cgroup_cpu_stat_lock);

If the hierarchy is large enough and the stat data hasn't been read
recently, it may take a while to accumulate all the stat data even for
one cpu in cgroup_stat_flush_locked(). So I think it will make more
sense to use regular spinlock_t instead of raw_spinlock_t.

> +
> +static struct cgroup_cpu_stat *cgroup_cpu_stat(struct cgroup *cgrp, int cpu)
> +{
> + return per_cpu_ptr(cgrp->cpu_stat, cpu);
> +}
> +
> +/**
> + * cgroup_cpu_stat_updated - keep track of updated cpu_stat
> + * @cgrp: target cgroup
> + * @cpu: cpu on which cpu_stat was updated
> + *
> + * @cgrp's cpu_stat on @cpu was updated. Put it on the parent's matching
> + * cpu_stat->updated_children list.
> + */
> +static void cgroup_cpu_stat_updated(struct cgroup *cgrp, int cpu)
> +{
> + raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_cpu_stat_lock, cpu);
> + struct cgroup *parent;
> + unsigned long flags;
> +
> + /*
> + * Speculative already-on-list test. This may race leading to
> + * temporary inaccuracies, which is fine.
> + *
> + * Because @parent's updated_children is terminated with @parent
> + * instead of NULL, we can tell whether @cgrp is on the list by
> + * testing the next pointer for NULL.
> + */
> + if (cgroup_cpu_stat(cgrp, cpu)->updated_next)
> + return;
> +
> + raw_spin_lock_irqsave(cpu_lock, flags);
> +
> + /* put @cgrp and all ancestors on the corresponding updated lists */
> + for (parent = cgroup_parent(cgrp); parent;
> + cgrp = parent, parent = cgroup_parent(cgrp)) {
> + struct cgroup_cpu_stat *cstat = cgroup_cpu_stat(cgrp, cpu);
> + struct cgroup_cpu_stat *pcstat = cgroup_cpu_stat(parent, cpu);
> +
> + /*
> + * Both additions and removals are bottom-up. If a cgroup
> + * is already in the tree, all ancestors are.
> + */
> + if (cstat->updated_next)
> + break;
> +
> + cstat->updated_next = pcstat->updated_children;
> + pcstat->updated_children = cgrp;
> + }
> +
> + raw_spin_unlock_irqrestore(cpu_lock, flags);
> +}
> +
> +/**
> + * cgroup_cpu_stat_next_updated - iterate cpu_stat updated tree
> + * @pos: current position
> + * @root: root of the tree to traversal
> + * @cpu: target cpu
> + *
> + * Walks the udpated cpu_stat tree on @cpu from @root. %NULL @pos starts
> + * the traversal and %NULL return indicates the end. During traversal,
> + * each returned cgroup is unlinked from the tree. Must be called with the
> + * matching cgroup_cpu_stat_lock held.
> + *
> + * The only ordering guarantee is that, for a parent and a child pair
> + * covered by a given traversal, if a child is visited, its parent is
> + * guaranteed to be visited afterwards.
> + */
> +static struct cgroup *cgroup_cpu_stat_next_updated(struct cgroup *pos,
> + struct cgroup *root, int cpu)

This function is trying to unwind one cgrp from the updated_children and
updated_next linkage. It is somewhat like the opposite of
cgroup_cpu_stat_updated(). I just feel like its name isn't intuitive
enough to convey what it is doing. Maybe use name like
cgroup_cpu_stat_unlink_one() to match cgroup_cpu_stat_flush_one().

Cheers,
Longman