Re: RT sched: cpupri_vec lock contention with def_root_domain andno load balance

From: Max Krasnyansky
Date: Mon Nov 24 2008 - 16:46:26 EST


Li Zefan wrote:
Max Krasnyansky wrote:
Dimitri Sivanich wrote:
kernel: CPU3 root domain e0000069ecb20000
kernel: CPU3 attaching sched-domain:
kernel: domain 0: span 3 level NODE
kernel: groups: 3
kernel: CPU2 root domain e000006884a00000
kernel: CPU2 attaching sched-domain:
kernel: domain 0: span 2 level NODE
kernel: groups: 2
kernel: CPU1 root domain e000006884a20000
kernel: CPU1 attaching sched-domain:
kernel: domain 0: span 1 level NODE
kernel: groups: 1
kernel: CPU0 root domain e000006884a40000
kernel: CPU0 attaching sched-domain:
kernel: domain 0: span 0 level NODE
kernel: groups: 0

Which is the way sched_load_balance is supposed to work. You need to set
sched_load_balance=0 for all cpusets containing any cpu you want to disable
balancing on, otherwise some balancing will happen.
It won't be much of a balancing in this case because this just one cpu per
domain.
In other words no that's not how it supposed to work. There is code in
cpu_attach_domain() that is supposed to remove redundant levels
(sd_degenerate() stuff). There is an explicit check in there for numcpus == 1.
btw The reason you got a different result that I did is because you have a
NUMA box where is mine is UMA. I was able to reproduce the problem though by
enabling multi-core scheduler. In which case I also get one redundant domain
level CPU, with a single CPU in it.
So we definitely need to fix this. I'll try to poke around tomorrow and figure
out why redundant level is not dropped.


You were not using latest kernel, were you?

There was a bug in sd degenerate code, and it has already been fixed:
http://lkml.org/lkml/2008/11/8/10
Ah, makes sense.
The funny part is that I did see the patch before but completely forgot about it :).

So when we do that for just par3, we get the following:
echo 0 > par3/cpuset.sched_load_balance
kernel: cpusets: rebuild ndoms 3
kernel: cpuset: domain 0 cpumask
00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0
0000000,00000000,00000000,00000000,0
kernel: cpuset: domain 1 cpumask
00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0
0000000,00000000,00000000,00000000,0
kernel: cpuset: domain 2 cpumask
00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0
0000000,00000000,00000000,00000000,0
kernel: CPU3 root domain default
kernel: CPU3 attaching NULL sched-domain.

So the def_root_domain is now attached for CPU 3. And we do have a NULL
sched-domain, which we expect for a cpu with load balancing turned off. If
we turn sched_load_balance off ('0') on each of the other cpusets (par0-2),
each of those cpus would also have a NULL sched-domain attached.
Ok. This one is a bug in cpuset.c:generate_sched_domains(). Sched domain
generator in cpusets should not drop domains with single cpu in them when
sched_load_balance==0. I'll look at that tomorrow too.


Do you mean the correct behavior should be as following?
kernel: cpusets: rebuild ndoms 4
Yes.

But why do you think this is a bug? In generate_sched_domains(), cpusets with
sched_load_balance==0 will be skippped:

list_add(&top_cpuset.stack_list, &q);
while (!list_empty(&q)) {
...
if (is_sched_load_balance(cp)) {
csa[csn++] = cp;
continue;
}
...
}

Correct me if I misunderstood your point.
The problem is that all cpus in cpusets with sched_load_balance==0 end up in the default root_domain which causes lock contention.
We can fix it either in sched.c:partition_sched_domains() or in cpusets.c:generate_sched_domains(). I'd rather fix cpusets because sched.c fix will be sub-optimal. See my answer to Greg on the same thread. Basically the scheduler code would have to allocate a root_domain for each CPU even on transitional states. So I'd rather fix cpusets to generate domain for each non-overlapping cpuset regardless of the sched_load_balance flag.

Max
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/