Re: [PATCH] sched: Fix numabalancing to work with isolated cpus

From: Srikar Dronamraju
Date: Wed Apr 05 2017 - 11:22:48 EST


* Michal Hocko <mhocko@xxxxxxxxxx> [2017-04-05 14:57:43]:

> On Tue 04-04-17 22:57:28, Srikar Dronamraju wrote:
> [...]
> > For example:
> > perf bench numa mem --no-data_rand_walk -p 4 -t $THREADS -G 0 -P 3072 -T 0 -l 50 -c -s 1000
> > would call sched_setaffinity that resets the cpus_allowed mask.
> >
> > Cpus_allowed_list: 0-55,57-63,65-71,73-79,81-87,89-175
> > Cpus_allowed_list: 0,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120,128,136,144,152,160,168
> > Cpus_allowed_list: 0,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120,128,136,144,152,160,168
> > Cpus_allowed_list: 0,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120,128,136,144,152,160,168
> > Cpus_allowed_list: 0,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120,128,136,144,152,160,168
> >
> > The isolated cpus are part of the cpus allowed list. In the above case,
> > numabalancing ends up scheduling some of these tasks on isolated cpus.
>
> Why is this bad? If the task is allowed to run on isolated CPUs then why

1. kernel-parameters.txt states: isolcpus as "Isolate CPUs from the
general scheduler." So the expectation that numabalancing can schedule
tasks on it is wrong.

2. If numabalancing was disabled, the task would never run on the
isolated CPUs.

3. With the faulty behaviour, it was observed that tasks scheduled on
the isolated cpus might end up taking more time, because they never get
a chance to move back to a node which has local memory.

4. The isolated cpus may be idle at that point, but actual work may be
scheduled on isolcpus later (when numabalancing had already scheduled
work on to it.) Since scheduler doesnt do any balancing on isolcpus even
if they are overloaded and the system is completely free, the isolcpus
stay overloaded.

> shouldn't its numa balancing be allowed the same? The changelog
> describes what but doesn't explain _why_ this change is needed/useful.