[RFC PATCH 00/13] sched: Integrating Per-entity-load-tracking withthe core scheduler

From: Preeti U Murthy
Date: Thu Oct 25 2012 - 06:25:34 EST

Next message: Preeti U Murthy: "[RFC PATCH 04/13] sched:Decide group_imb using PJT's metric"
Previous message: Ivo Sieben: "[REPOST-v2] sched: Prevent wakeup to enter critical section needlessly"
Next in thread: Preeti U Murthy: "[RFC PATCH 04/13] sched:Decide group_imb using PJT's metric"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This patchset uses the per-entity-load-tracking patchset which will soon be
available in the kernel.It is based on the tip/master tree before the
(HEAD at b654f92c06e562c)integration of per-entity-load-tracking patchset.
The first 8 latest patches of sched:per-entity-load-tracking alone have
been imported to the tree from the quilt series of Peter(when they were
present) to avoid the complexities of task groups and to hold back the
optimizations of this patchset for now.This patchset is based at this level.
Refer https://lkml.org/lkml/2012/10/12/9.This series is a continuation
of the patchset in this link.

This patchset is an attempt to begin the integration of PJT's
metric with the load balancer in a step wise fashion,and progress based
on the consequences.This patchset has been tested with the config excluding
CONFIG_FAIR_GROUP_SCHED.

The following issues have been considered towards this:
[NOTE:an x% task referred to in the logs and below is calculated over a
duty cycle of 10ms.]

1.Consider a scenario,where there are two 10% tasks running on a cpu.The
present code will consider the load on this queue to be 2048,while
using PJT's metric the load is calculated to be <1000,rarely exceeding this
limit.Although the tasks are not contributing much to the cpu load,they are
decided to be moved by the scheduler.

But one could argue that 'not moving one of these tasks could throttle
them.If there was an idle cpu,perhaps we could have moved them'.While the
power save mode would have been fine with not moving the task,the
performance mode would prefer not to throttle the tasks.We could strive
to strike a balance by making this decision tunable with certain parameters.
This patchset includes such tunables.This issue is addressed in Patch[1/2].

*The advantage of this behavior of PJT's metric has been demonstrated via
an experiment*.Please see the reply to this cover letter to be posted right
away.

2.We need to be able to do this cautiously,as the scheduler code is too
complex.This patchset is an attempt to begin the integration of PJT's
metric with the load balancer in a step wise fashion,and progress based on
the consequences.
*What this patchset essentially does is in two primary places of the
scheduler,PJT's metric has replaced the existing metric to make decisions for load
balancing*.
1.load_balance()
2.select_task_rq_fair()

This description of the patches are below:

Patch[1/13]: This patch aims at detecting short running tasks and
prevent their movement.In update_sg_lb_stats,dismiss a sched group
as a candidate for load balancing,if load calculated by PJT's metric
says that the average load on the sched_group <= 1024+(.15*1024).
This is a tunable,which can be varied after sufficient experiments.

Patch[2/13]:In the current scheduler greater load would be analogous
to more number of tasks.Therefore when the busiest group is picked
from the sched domain in update_sd_lb_stats,only the loads of the
groups are compared between them.If we were to use PJT's metric,a
higher load does not necessarily mean more number of tasks.This
patch addresses this issue.

Patch[3/13] to Patch[13/13] : Replacement of the existing metrics
deciding load balancing and selecting a runqueue for load
placement,with the PJT's metric and subsequent usage of PJT's metric
for schduling.

3.The Primary advantage that I see in integrating PJT's metric with the core
scheduler is listed below:

1. Excluding short running tasks from being candidates for load balancing.
This would avoid unnecessary migrations when the CPU is not sufficiently
loaded.This advantage has been portrayed in the results of the
experiment.

Run the workload attached.There are 8 threads spwaned each being 10%
tasks.
The number of migrations was measured from /proc/schedstat

Machine: 1 socket 4 core pre-nehalem.

Experimental Setup:
cat /proc/schedstat > stat_initial
gcc -Wall -Wshadow -lpthread -o test test.c
cat /proc/schedstat > stat_final
The difference in the number of pull requests from both these files have
been calculated and are as below:

Observations:
With_Patchset Without_patchset
---------------------------------------------------------------------
Average_number_of_migrations 0 46
Average_number_of_records/s 9,71,114 9,45,158

With more memory intensive workloads, a higher difference in the number of
migrations is seen without any performance compromise.

---

Preeti U Murthy (13):
sched:Prevent movement of short running tasks during load balancing
sched:Pick the apt busy sched group during load balancing
sched:Decide whether there be transfer of loads based on the PJT's metric
sched:Decide group_imb using PJT's metric
sched:Calculate imbalance using PJT's metric
sched:Changing find_busiest_queue to use PJT's metric
sched:Change move_tasks to use PJT's metric
sched:Some miscallaneous changes in load_balance
sched:Modify check_asym_packing to use PJT's metric
sched:Modify fix_small_imbalance to use PJT's metric
sched:Modify find_idlest_group to use PJT's metric
sched:Modify find_idlest_cpu to use PJT's metric
sched:Modifying wake_affine to use PJT's metric

kernel/sched/fair.c | 262 ++++++++++++++++++++++++++++++++++++---------------
1 file changed, 186 insertions(+), 76 deletions(-)

--
Preeti U Murthy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Preeti U Murthy: "[RFC PATCH 04/13] sched:Decide group_imb using PJT's metric"
Previous message: Ivo Sieben: "[REPOST-v2] sched: Prevent wakeup to enter critical section needlessly"
Next in thread: Preeti U Murthy: "[RFC PATCH 04/13] sched:Decide group_imb using PJT's metric"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]