[PATCH updated] sched/fair: core wide cfs task priority comparison

From: Aaron Lu
Date: Mon Apr 20 2020 - 04:08:14 EST


On Fri, Apr 17, 2020 at 05:40:45PM +0800, Aaron Lu wrote:
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -291,8 +280,13 @@ static int __sched_core_stopper(void *data)
> }
>
> for_each_online_cpu(cpu) {
> - if (!enabled || (enabled && cpumask_weight(cpu_smt_mask(cpu)) >= 2))
> - cpu_rq(cpu)->core_enabled = enabled;
> + if (!enabled || (enabled && cpumask_weight(cpu_smt_mask(cpu)) >= 2)) {
> + struct rq *rq = cpu_rq(cpu);
> +
> + rq->core_enabled = enabled;
> + if (rq->core == rq)
> + sched_core_adjust_se_vruntime(cpu);

The adjust is only needed when core scheduling is enabled while I
mistakenly called it on both enable and disable. And I come to think
normalize is a better name than adjust.

> + }
> }
>
> return 0;

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d99ea6ee7af2..7eecf590d6c0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> +void sched_core_adjust_se_vruntime(int cpu)
> +{
> + int i;
> +
> + for_each_cpu(i, cpu_smt_mask(cpu)) {
> + struct cfs_rq *cfs_rq, *sibling_cfs_rq;
> + struct sched_entity *se, *next;
> + s64 delta;
> +
> + if (i == cpu)
> + continue;
> +
> + sibling_cfs_rq = &cpu_rq(i)->cfs;
> + if (!sibling_cfs_rq->nr_running)
> + continue;
> +
> + cfs_rq = &cpu_rq(cpu)->cfs;
> + delta = cfs_rq->min_vruntime - sibling_cfs_rq->min_vruntime;
> + /*
> + * XXX Malicious user can create a ton of runnable tasks in root
> + * sibling_cfs_rq and cause the below vruntime normalization
> + * potentially taking a long time.
> + */

Testing on a qemu/kvm VM shows that normalizing 32268 sched entities
takes about 6ms time so I think the risk is low, therefore, I'm going to
remove the XXX comment.

(I disabled CONFIG_SCHED_AUTOGROUP and started 32268 cpuhog tasks on one
cpu using taskset, adding trace_printk() before and after the below loop
gives me:
migration/0-11 [000] d..1 674.546882: sched_core_normalize_se_vruntime: cpu5: normalize nr_running=32268
migration/0-11 [000] d..1 674.552364: sched_core_normalize_se_vruntime: cpu5: normalize done
)

> + rbtree_postorder_for_each_entry_safe(se, next,
> + &sibling_cfs_rq->tasks_timeline.rb_root,
> + run_node) {
> + se->vruntime += delta;
> + }
> + }
> }
>
> static __always_inline

I also think the patch is not to make every sched entity's vruntime core
wide but to make it possible to do core wide priority comparison for cfs
tasks so I changed the subject. Here is the updated patch: