Re: [PATCH 0/6] [RFC] Large weight differential leads to inefficient load balancing

From: Paul Turner
Date: Fri Jul 30 2010 - 15:04:35 EST

Next message: Greg KH: "[069/165] USB: obey the sysfs power/wakeup setting"
Previous message: Kristoffer Ericson: "Re: [PATCH] jornada720_bl/jornada720_lcd: make needlessly globalsymbols static"
In reply to: Nikhil Rao: "Re: [PATCH 0/6] [RFC] Large weight differential leads to inefficient load balancing"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Fri, Jul 30, 2010 at 6:32 AM, Mike Galbraith <efault@xxxxxx> wrote:
> On Thu, 2010-07-29 at 22:19 -0700, Nikhil Rao wrote:
>> Hi all,
>>
>> We have observed that a large weight differential between tasks on a runqueue
>> leads to sub-optimal machine utilization and poor load balancing. For example,
>> if you have lots of SCHED_IDLE tasks (sufficient number to keep the machine 100%
>> busy) and a few SCHED_NORMAL soaker tasks, we see that the machine has
>> significant idle time.
>>
>> The data below highlights this problem. The test machine is a 4 socket quad-core
>> box (16 cpus). These experiemnts were done with v2.6.25-rc6. We spawn 16
>> SCHED_IDLE soaker threads (one per-cpu) to completely fill up the machine. CPU
>> utilization numbers gathered from mpstat for 10s are:
>>
>> 03:30:24 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s
>> 03:30:25 PM all 99.94 0.00 0.06 0.00 0.00 0.00 0.00 0.00 16234.65
>> 03:30:26 PM all 99.88 0.06 0.06 0.00 0.00 0.00 0.00 0.00 16374.00
>> 03:30:27 PM all 99.94 0.00 0.06 0.00 0.00 0.00 0.00 0.00 16392.00
>> 03:30:28 PM all 99.94 0.00 0.06 0.00 0.00 0.00 0.00 0.00 16612.12
>> 03:30:29 PM all 99.88 0.00 0.12 0.00 0.00 0.00 0.00 0.00 16375.00
>> 03:30:30 PM all 99.94 0.06 0.00 0.00 0.00 0.00 0.00 0.00 16440.00
>> 03:30:31 PM all 99.81 0.00 0.19 0.00 0.00 0.00 0.00 0.00 16237.62
>> 03:30:32 PM all 99.94 0.00 0.06 0.00 0.00 0.00 0.00 0.00 16360.00
>> 03:30:33 PM all 99.94 0.00 0.06 0.00 0.00 0.00 0.00 0.00 16405.00
>> 03:30:34 PM all 99.38 0.06 0.50 0.00 0.00 0.00 0.00 0.06 18881.82
>> Average: all 99.86 0.02 0.12 0.00 0.00 0.00 0.00 0.01 16628.20
>>
>> We then spawn one SCHED_NORMAL while-1 task (the absolute number does not matter
>> so long as we introduce some large weight differential).
>>
>> 03:40:57 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s
>> 03:40:58 PM all 83.06 0.00 0.06 0.00 0.00 0.00 0.00 16.88 14555.00
>> 03:40:59 PM all 78.25 0.00 0.06 0.00 0.00 0.00 0.00 21.69 14527.00
>> 03:41:00 PM all 82.71 0.06 0.06 0.00 0.00 0.00 0.00 17.17 14879.00
>> 03:41:01 PM all 87.34 0.00 0.06 0.00 0.00 0.00 0.00 12.59 15466.00
>> 03:41:02 PM all 80.80 0.06 0.19 0.00 0.00 0.00 0.00 18.95 14584.00
>> 03:41:03 PM all 82.90 0.00 0.06 0.00 0.00 0.00 0.00 17.04 14570.00
>> 03:41:04 PM all 79.45 0.00 0.06 0.00 0.00 0.00 0.00 20.49 14536.00
>> 03:41:05 PM all 86.48 0.00 0.07 0.00 0.00 0.00 0.00 13.46 14577.00
>> 03:41:06 PM all 76.73 0.06 0.06 0.00 0.00 0.06 0.00 23.10 14594.00
>> 03:41:07 PM all 86.48 0.00 0.07 0.00 0.00 0.00 0.00 13.45 14703.03
>> Average: all 82.31 0.02 0.08 0.00 0.00 0.01 0.00 17.59 14699.10
>
> What happens with s/SCHED_IDLE/nice 19?
>
> -Mike
>
>

So this is also a concern of mine that I wanted to raise.

The problem observed above is a function of balancing entities which
can't meaningfully contribute to improving the observed imbalance.

While this does occur with {nice 19, SCHED_IDLE} threads, it is a more
general problem and actually much more prevalent in the group
scheduling case where the group entity weight is more arbitrary. This
is compounded by the fact that we iterate over the task_groups in load
balance and don't have a good way of differentiating in-group high/low
weight entities.

Something I have been investigating is how can we fix this problem
more generally: e.g. how can we effectively load-balance a low-weight
entity in the presence of high-weight competition. This could just as
easily be a {1000, 2} share split as it could be a {64000, 2000} share
split -- the latter won't benefit from IDLE_SCHEDULING optimizations
but is a realistic test case if someone is trying to provide minimal
latencies for the high-weight task.

What I've been considering in avenue is instead of load-balancing the
low weight entities when their h_load relative to the imbalance is
'not meaningful' -- by some fitness function, tag them for a separate
load-balancing pass. This secondary pass would only need to be
periodic in nature, and consider the 'tagged' task_groups from the
generic load_balancing operations. At this point in time it can be
evaluated whether the load is still 'not meaningful', and distribution
towards the most idle cpu (by utilization) can be evaluated.

The one converse here is that while something like the above is more
general for the low weight-group entity case, it does not extend
particularly cleanly to the non group-scheduling case. I haven't yet
got a particularly good answer for this since without a group entity
representing the low-weight threads tracking them for a second pass
becomes more cumbersome.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Greg KH: "[069/165] USB: obey the sysfs power/wakeup setting"
Previous message: Kristoffer Ericson: "Re: [PATCH] jornada720_bl/jornada720_lcd: make needlessly globalsymbols static"
In reply to: Nikhil Rao: "Re: [PATCH 0/6] [RFC] Large weight differential leads to inefficient load balancing"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]