Re: [PATCH] sched: select 'idle' cfs_rq per task-group to prevent tg-internal imbalance

From: Michael wang
Date: Mon Jun 30 2014 - 03:36:56 EST

Next message: Joerg Roedel: "Re: [git pull] IOMMU Fixes for Linux v3.16-rc2"
Previous message: Axel Lin: "[PATCH 2/2] regulator: act8865: Explictly initialize of_node array"
Next in thread: Mike Galbraith: "Re: [PATCH] sched: select 'idle' cfs_rq per task-group to prevent tg-internal imbalance"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 06/18/2014 12:50 PM, Michael wang wrote:
> By testing we found that after put benchmark (dbench) in to deep cpu-group,
> tasks (dbench routines) start to gathered on one CPU, which lead to that the
> benchmark could only get around 100% CPU whatever how big it's task-group's
> share is, here is the link of the way to reproduce the issue:

Hi, Peter

We thought that involved too much factors will make things too
complicated, we are trying to start over and get rid of the concepts of
'deep-group' and 'GENTLE_FAIR_SLEEPERS' in the idea, wish this could
make things more easier...

Let's put down the prev-discussions, for now we just want to proposal a
cpu-group feature which could help dbench to gain enough CPU while
stress is running, in a gentle way which hasn't yet been provided by
current scheduler.

I'll post a new patch on that later, we're looking forward your comments
on it :)

Regards,
Michael Wang

>
> https://lkml.org/lkml/2014/5/16/4
>
> Please note that our comparison was based on the same workload, the only
> difference is we put the workload one level deeper, and dbench could only
> got 1/3 of the CPU% it used to have, the throughput dropped to half.
>
> The dbench got less CPU since all it's instances start to gathering on the
> same CPU more often than before, and in such cases, whatever how big their
> share is, only one CPU they could occupy.
>
> This is caused by that when dbench is in deep-group, the balance between
> it's gathering speed (depends on wake-affine) and spreading speed (depends
> on load-balance) was broken, that is more gathering chances while less
> spreading chances.
>
> Since after put dbench into deep group, it's representive load in root-group
> become less, which make it harder to break the load balance of system, this
> is a comparison between dbench root-load and system-tasks (besides dbench)
> load, for eg:
>
> sg0 sg1
> cpu0 cpu1 cpu2 cpu3
>
> kworker/0:0 kworker/1:0 kworker/2:0 kworker/3:0
> kworker/0:1 kworker/1:1 kworker/2:1 kworker/3:1
> dbench
> dbench
> dbench
> dbench
> dbench
> dbench
>
> Here without dbench, the load between sg is already balanced, which is:
>
> 4096:4096
>
> When dbench is in one of the three cpu-cgroups on level 1, it's root-load
> is 1024/6, so we have:
>
> sg0
> 4096 + 6 * (1024 / 6)
> sg1
> 4096
>
> sg0 : sg1 == 5120 : 4096 == 125%
>
> bigger than imbalance-pct (117% for eg), dbench spread to sg1
>
>
> When dbench is in one of the three cpu-cgroups on level 2, it's root-load
> is 1024/18, now we have:
>
> sg0
> 4096 + 6 * (1024 / 18)
> sg1
> 4096
>
> sg0 : sg1 ~= 4437 : 4096 ~= 108%
>
> smaller than imbalance-pct (same the 117%), dbench keep gathering in sg0
>
> Thus load-balance routine become inactively on spreading dbench to other CPU,
> and it's routine keep gathering on CPU more longer than before.
>
> This patch try to select 'idle' cfs_rq inside task's cpu-group when there is no
> idle CPU located by select_idle_sibling(), instead of return the 'target'
> arbitrarily, this recheck help us to reserve the effect of load-balance longer,
> and help to make the system more balance.
>
> Like in the example above, the fix now will make things as:
> 1. dbench instances will be 'balanced' inside tg, ideally each cpu will
> have one instance.
> 2. if 1 do make the load become imbalance, load-balance routine will do
> it's job and move instances to proper CPU.
> 3. after 2 was done, the target CPU will always be preferred as long as
> it only got one instance.
>
> Although for tasks like dbench, 2 is rarely happened, while combined with 3, we
> will finally locate a good CPU for each instance which make both internal and
> external balanced.
>
> After applied this patch, the behaviour of dbench in deep cpu-group become
> normal, the dbench throughput was back.
>
> Tested benchmarks like ebizzy, kbench, dbench on X86 12-CPU server, the patch
> works well and no regression showup.
>
> Highlight:
> With out a fix, any similar workload like dbench will face the same
> issue that the cpu-cgroup share lost it's effect
>
> This may not just be an issue of cgroup, whenever we have tasks which
> with small-load, play quick flip on each other, they may gathering.
>
> Please let me know if you have any questions on whatever the issue or the fix,
> comments are welcomed ;-)
>
> CC: Ingo Molnar <mingo@xxxxxxxxxx>
> CC: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> Signed-off-by: Michael Wang <wangyun@xxxxxxxxxxxxxxxxxx>
> ---
> kernel/sched/fair.c | 81 +++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 81 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index fea7d33..e1381cd 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4409,6 +4409,62 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
> return idlest;
> }
>
> +static inline int tg_idle_cpu(struct task_group *tg, int cpu)
> +{
> + return !tg->cfs_rq[cpu]->nr_running;
> +}
> +
> +/*
> + * Try and locate an idle CPU in the sched_domain from tg's view.
> + *
> + * Although gathered on same CPU and spread accross CPUs could make
> + * no difference from highest group's view, this will cause the tasks
> + * starving, even they have enough share to fight for CPU, they only
> + * got one battle filed, which means whatever how big their weight is,
> + * they totally got one CPU at maximum.
> + *
> + * Thus when system is busy, we filtered out those tasks which couldn't
> + * gain help from balance routine, and try to balance them internally
> + * by this func, so they could stand a chance to show their power.
> + *
> + */
> +static int tg_idle_sibling(struct task_struct *p, int target)
> +{
> + struct sched_domain *sd;
> + struct sched_group *sg;
> + int i = task_cpu(p);
> + struct task_group *tg = task_group(p);
> +
> + if (tg_idle_cpu(tg, target))
> + goto done;
> +
> + sd = rcu_dereference(per_cpu(sd_llc, target));
> + for_each_lower_domain(sd) {
> + sg = sd->groups;
> + do {
> + if (!cpumask_intersects(sched_group_cpus(sg),
> + tsk_cpus_allowed(p)))
> + goto next;
> +
> + for_each_cpu(i, sched_group_cpus(sg)) {
> + if (i == target || !tg_idle_cpu(tg, i))
> + goto next;
> + }
> +
> + target = cpumask_first_and(sched_group_cpus(sg),
> + tsk_cpus_allowed(p));
> +
> + goto done;
> +next:
> + sg = sg->next;
> + } while (sg != sd->groups);
> + }
> +
> +done:
> +
> + return target;
> +}
> +
> /*
> * Try and locate an idle CPU in the sched_domain.
> */
> @@ -4417,6 +4473,7 @@ static int select_idle_sibling(struct task_struct *p, int target)
> struct sched_domain *sd;
> struct sched_group *sg;
> int i = task_cpu(p);
> + struct sched_entity *se = task_group(p)->se[i];
>
> if (idle_cpu(target))
> return target;
> @@ -4451,6 +4508,30 @@ next:
> } while (sg != sd->groups);
> }
> done:
> +
> + if (!idle_cpu(target)) {
> + /*
> + * No idle cpu located imply the system is somewhat busy,
> + * usually we count on load balance routine's help and
> + * just pick the target whatever how busy it is.
> + *
> + * However, when task belong to a deep group (harder to
> + * make root imbalance) and flip frequently (harder to be
> + * caught during balance), load balance routine could help
> + * nothing, and these tasks will eventually gathered on same
> + * cpu when they wakeup each other, that is the chance of
> + * gathered stand far more higher than the chance of spread.
> + *
> + * Thus for such tasks, we need to handle them carefully
> + * during wakeup, since it's the very rarely chance for
> + * them to spread.
> + *
> + */
> + if (se && se->depth &&
> + p->wakee_flips > this_cpu_read(sd_llc_size))
> + return tg_idle_sibling(p, target);
> + }
> +
> return target;
> }
>
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Joerg Roedel: "Re: [git pull] IOMMU Fixes for Linux v3.16-rc2"
Previous message: Axel Lin: "[PATCH 2/2] regulator: act8865: Explictly initialize of_node array"
Next in thread: Mike Galbraith: "Re: [PATCH] sched: select 'idle' cfs_rq per task-group to prevent tg-internal imbalance"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]