Re: [PATCH] sched: sync with the cfs_rq when changing sched class

From: Yuyang Du
Date: Fri Aug 14 2015 - 03:10:52 EST


On Thu, Aug 13, 2015 at 05:22:12PM +0200, Peter Zijlstra wrote:
> > when did I say "don't need to consider..."?
> >
> > Doing more does not mean better, or just trivial. BTW, the task switched_to
> > does not have to be switched_from before.
>
> Correct, there's a few corner cases we need to consider.
>
> However, I think we unconditionally call init_entity_runnable_average()
> on all tasks, regardless of their 'initial' sched class, so it should
> have a valid state.
>
> Another thing to consider is the state being very stale, suppose it
> started live as FAIR, ran for a bit, got switched to !FAIR by means of
> sys_sched_setscheduler()/sys_sched_setattr() or similar, runs for a long
> time and for some reason gets switched back to FAIR, we need to age and
> or re-init things.
>
> I _think_ we can use last_update_time for that, but I've not looked too
> hard.
>
> That is, age based on last_update_time, if all 0, reinit, or somesuch.
>
>
> The most common case of switched_from()/switched_to() is Priority
> Inheritance, and that typically results in very short lived stints as
> !FAIR and the avg data should be still accurate by the time we return.

Right, we have the following cases to consider:
queued or not * PI or sched_setscheduler * switched_from_fair or not

After enumerating them, it simply boils down to this basic question:
with either short or long lost record, what we predict?

It can have to do with perspective, preference, etc, and it differs
in how frequent it happens, which we lack. Otherwise, we don't have
the context to see what difference it makes.

To address this, I think the following patch might be helpful first.

--
sched: Provide sched class and priority change statistics to task

The sched class and priority changes make substantial impact for
a task, but we really have not a quantitative understanding of how
frequent they are, which makes it diffcult to work on many cases
especially some corner cases. We privide it at task level.

Signed-off-by: Yuyang Du <yuyang.du@xxxxxxxxx>
---
include/linux/sched.h | 3 +++
kernel/sched/core.c | 8 +++++++-
kernel/sched/debug.c | 2 ++
3 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5b50082..9111696 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1228,6 +1228,9 @@ struct sched_statistics {
u64 nr_wakeups_affine_attempts;
u64 nr_wakeups_passive;
u64 nr_wakeups_idle;
+
+ u64 class_change;
+ u64 priority_change;
};
#endif

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 655557d..b65af01 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1009,12 +1009,18 @@ static inline void check_class_changed(struct rq *rq, struct task_struct *p,
int oldprio)
{
if (prev_class != p->sched_class) {
+ schedstat_inc(p, se.statistics.class_change);
+ schedstat_inc(p, se.statistics.priority_change);
+
if (prev_class->switched_from)
prev_class->switched_from(rq, p);

p->sched_class->switched_to(rq, p);
- } else if (oldprio != p->prio || dl_task(p))
+ } else if (oldprio != p->prio || dl_task(p)) {
+ schedstat_inc(p, se.statistics.priority_change);
+
p->sched_class->prio_changed(rq, p, oldprio);
+ }
}

void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags)
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 6415117..fac8b89 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -597,6 +597,8 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m)
P(se.statistics.nr_wakeups_affine_attempts);
P(se.statistics.nr_wakeups_passive);
P(se.statistics.nr_wakeups_idle);
+ P(se.statistics.class_change);
+ P(se.statistics.priority_change);

{
u64 avg_atom, avg_per_cpu;
--
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/