Re: [RFC PATCH v2 11/17] sched: Basic tracking of matching tasks

From: Tim Chen
Date: Thu May 09 2019 - 20:10:38 EST


On 5/9/19 10:50 AM, Subhra Mazumdar wrote:
>
>>> select_task_rq_* seems to be unchanged. So the search logic to find a cpu
>>> to enqueue when a task becomes runnable is same as before and doesn't do
>>> any kind of cookie matching.
>> Okay, that's true in task wakeup path, and also load_balance seems to pull task
>> without checking cookie too. But my system is not over loaded when I tested this
>> patch, so there is none or only one task in rq and on the rq's rb
>> tree, so this patch
>> does not make a difference.
> I had same hypothesis for my tests.
>>
>> The question is, should we do cookie checking for task selecting CPU and load
>> balance CPU pulling task?
> The basic issue is keeping the CPUs busy. In case of overloaded system,
> the trivial new idle balancer should be able to find a matching task
> in case of forced idle. More problematic is the lower load scenario when
> there aren't any matching task to be found but there are runnable tasks of
> other groups. Also wake up code path tries to balance threads across cores
> (select_idle_core) first which is opposite of what core scheduling wants.
> I will re-run my tests with select_idle_core disabled, but the issue is
> on x86 Intel systems (my test rig) the CPU ids are interleaved across cores
> so even select_idle_cpu will balance across cores first. May be others have
> some better ideas?
>>

We did an experiment on a coffee lake desktop that has 6 cores to see how load
balancing works for core scheduling.

In a nutshell, it seems like for workload like sysbench that are constant
and doesn't have much sleep/wakeups, load balancer is doing a pretty
good job, right on the money. However, when we are overcommiting the
cpus heavily, and the load is non-constant with I/Os and lots of forks
like doing kernel build, it is much harder to get tasks placed optimally.

We set up two VMs, each in its own cgroup. In one VM, we run the
benchmark. In the other VM, we run a cpu hog task for each vcpu to
provide a constant background load.

The HT on case with no core scheduling is used as baseline performance.

There are 6 cores on Coffee Lake test system. We pick 3, 6 and 12
vcpu cases for each VM to look at the 1/2 occupied, fully occupied
and 2x occupied system when HT is used.

Sysbench (Great for core sched)

Core Sched HT off
------ ----------
avg perf (std dev) avg perf (std dev)
3vcpu/VM +0.37% (0.18%) -1.52% (0.17%)
6vcpu/VM -3.36% (2.04%) -31.72% (0.13%)
12vcpu/VM +1.02% (1.17%) -31.03% (0.07%)

Kernel build (Difficult for core sched)

Core Sched HT off
------ ----------
avg perf (std dev) avg perf (std dev)
3vcpu/VM +0.05% (1.21%) -3.66% (0.81%)
6vcpu/VM -30.41% (3.03%) -40.73% (1.53%)
12vcpu/VM -34.03% (2.77%) -24.87% (1.22%)

Tim