Re: [RFC PATCH v3 00/16] Core scheduling v3

From: Julien Desfossez
Date: Thu Jun 06 2019 - 11:30:57 EST


On 31-May-2019 05:08:16 PM, Julien Desfossez wrote:
> > My first reaction is: when shell wakes up from sleep, it will
> > fork date. If the script is untagged and those workloads are
> > tagged and all available cores are already running workload
> > threads, the forked date can lose to the running workload
> > threads due to __prio_less() can't properly do vruntime comparison
> > for tasks on different CPUs. So those idle siblings can't run
> > date and are idled instead. See my previous post on this:
> >
> > https://lore.kernel.org/lkml/20190429033620.GA128241@aaronlu/
> > (Now that I re-read my post, I see that I didn't make it clear
> > that se_bash and se_hog are assigned different tags(e.g. hog is
> > tagged and bash is untagged).
> >
> > Siblings being forced idle is expected due to the nature of core
> > scheduling, but when two tasks belonging to two siblings are
> > fighting for schedule, we should let the higher priority one win.
> >
> > It used to work on v2 is probably due to we mistakenly
> > allow different tagged tasks to schedule on the same core at
> > the same time, but that is fixed in v3.
>
> I confirm this is indeed what is happening, we reproduced it with a
> simple script that only uses one core (cpu 2 and 38 are sibling on this
> machine):
>
> setup:
> cgcreate -g cpu,cpuset:test
> cgcreate -g cpu,cpuset:test/set1
> cgcreate -g cpu,cpuset:test/set2
> echo 2,38 > /sys/fs/cgroup/cpuset/test/cpuset.cpus
> echo 0 > /sys/fs/cgroup/cpuset/test/cpuset.mems
> echo 2,38 > /sys/fs/cgroup/cpuset/test/set1/cpuset.cpus
> echo 2,38 > /sys/fs/cgroup/cpuset/test/set2/cpuset.cpus
> echo 0 > /sys/fs/cgroup/cpuset/test/set1/cpuset.mems
> echo 0 > /sys/fs/cgroup/cpuset/test/set2/cpuset.mems
> echo 1 > /sys/fs/cgroup/cpu,cpuacct/test/set1/cpu.tag
>
> In one terminal:
> sudo cgexec -g cpu,cpuset:test/set1 sysbench --threads=1 --time=30
> --test=cpu run
>
> In another one:
> sudo cgexec -g cpu,cpuset:test/set2 date
>
> It's very clear that 'date' hangs until sysbench is done.
>
> We started experimenting with marking a task on the forced idle sibling
> if normalized vruntimes are equal. That way, at the next compare, if the
> normalized vruntimes are still equal, it prefers the task on the forced
> idle sibling. It still needs more work, but in our early tests it helps.

As mentioned above, we have come up with a fix for the long starvation
of untagged interactive threads competing for the same core with tagged
threads at the same priority. The idea is to detect the stall and boost
the stalling threads priority so that it gets a chance next time.
Boosting is done by a new counter(min_vruntime_boost) for every task
which we subtract from vruntime before comparison. The new logic looks
like this:

If we see that normalized runtimes are equal, we check the min_vruntimes
of their runqueues and give a chance for the task in the runqueue with
less min_vruntime. That will help it to progress its vruntime. While
doing this, we boost the priority of the task in the sibling so that, we
donât starve the task in the sibling until the min_vruntime of this
runqueue catches up.

If min_vruntimes are also equal, we do as before and consider the task
âaâ of higher priority. Here we boost the task âbâ so that it gets to
run next time.

The min_vruntime_boost is reset to zero once the task in on cpu. So only
waiting tasks will have a non-zero value if it is starved while matching
a task on the other sibling.

The attached patch has a sched_feature to enable the above feature so
that you can compare the results with and without this feature.

What we observe with this patch is that it helps for untagged
interactive tasks and fairness in general, but this increases the
overhead of core scheduling when there is contention for the CPU with
tasks of varying cpu usage. The general trend we see is that if there is
a cpu intensive thread and multiple relatively idle threads in different
tags, the cpu intensive tasks continuously yields to be fair to the
relatively idle threads when it becomes runnable. And if the relatively
idle threads make up for most of the tasks in a system and are tagged,
the cpu intensive tasks sees a considerable drop in performance.

If you have any feedback or creative ideas to help improve, let us
know !

Thanks


diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1a309e8..56cad0e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -642,6 +642,7 @@ struct task_struct {
struct rb_node core_node;
unsigned long core_cookie;
unsigned int core_occupation;
+ unsigned int core_vruntime_boost;
#endif

#ifdef CONFIG_CGROUP_SCHED
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 73329da..c302853 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -92,6 +92,10 @@ static inline bool prio_less(struct task_struct *a, struct task_struct *b)

int pa = __task_prio(a), pb = __task_prio(b);

+ trace_printk("(%s/%d;%d,%Lu,%Lu) ?< (%s/%d;%d,%Lu,%Lu)\n",
+ a->comm, a->pid, pa, a->se.vruntime, a->dl.deadline,
+ b->comm, b->pid, pb, b->se.vruntime, b->dl.deadline);
+
if (-pa < -pb)
return true;

@@ -102,21 +106,36 @@ static inline bool prio_less(struct task_struct *a, struct task_struct *b)
return !dl_time_before(a->dl.deadline, b->dl.deadline);

if (pa == MAX_RT_PRIO + MAX_NICE) { /* fair */
- u64 vruntime = b->se.vruntime;
-
- trace_printk("(%s/%d;%d,%Lu,%Lu) ?< (%s/%d;%d,%Lu,%Lu)\n",
- a->comm, a->pid, pa, a->se.vruntime, task_cfs_rq(a)->min_vruntime,
- b->comm, b->pid, pb, b->se.vruntime, task_cfs_rq(b)->min_vruntime);
+ u64 a_vruntime = a->se.vruntime - a->core_vruntime_boost;
+ u64 b_vruntime = b->se.vruntime - b->core_vruntime_boost;

/*
* Normalize the vruntime if tasks are in different cpus.
*/
if (task_cpu(a) != task_cpu(b)) {
- vruntime -= task_cfs_rq(b)->min_vruntime;
- vruntime += task_cfs_rq(a)->min_vruntime;
+ s64 min_vruntime_diff = task_cfs_rq(a)->min_vruntime -
+ task_cfs_rq(b)->min_vruntime;
+ b_vruntime += min_vruntime_diff;
+
+ trace_printk("(%d:%Lu,%Lu,%Lu) <> (%d:%Lu,%Lu,%Lu)\n",
+ a->pid, a_vruntime, a->se.vruntime, task_cfs_rq(a)->min_vruntime,
+ b->pid, b_vruntime, b->se.vruntime, task_cfs_rq(b)->min_vruntime);
+
+ if (sched_feat(CORESCHED_STALL_FIX) &&
+ a_vruntime == b_vruntime) {
+ bool less_prio = min_vruntime_diff > 0;
+
+ if (less_prio)
+ a->core_vruntime_boost++;
+ else
+ b->core_vruntime_boost++;
+
+ return less_prio;
+
+ }
}

- return !((s64)(a->se.vruntime - vruntime) <= 0);
+ return !((s64)(a_vruntime - b_vruntime) <= 0);
}

return false;
@@ -2456,6 +2475,9 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
#ifdef CONFIG_COMPACTION
p->capture_control = NULL;
#endif
+#ifdef CONFIG_SCHED_CORE
+ p->core_vruntime_boost = 0UL;
+#endif
init_numa_balancing(clone_flags, p);
}

@@ -3723,6 +3745,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
next->comm, next->pid,
next->core_cookie);

+ next->core_vruntime_boost = 0UL;
return next;
}

@@ -3835,6 +3858,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
trace_printk("max: %s/%d %lx\n", max->comm, max->pid, max->core_cookie);

if (old_max) {
+ if (old_max->core_vruntime_boost)
+ old_max->core_vruntime_boost--;
+
for_each_cpu(j, smt_mask) {
if (j == i)
continue;
@@ -3905,6 +3931,7 @@ next_class:;

done:
set_next_task(rq, next);
+ next->core_vruntime_boost = 0UL;
return next;
}

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 858589b..332a092 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -90,3 +90,9 @@ SCHED_FEAT(WA_BIAS, true)
* UtilEstimation. Use estimated CPU utilization.
*/
SCHED_FEAT(UTIL_EST, true)
+
+/*
+ * Prevent task stall due to vruntime comparison limitation across
+ * cpus.
+ */
+SCHED_FEAT(CORESCHED_STALL_FIX, false)