Re: Database regression due to scheduler changes ?

From: Brian Twichell
Date: Wed Nov 09 2005 - 00:03:45 EST


Nick Piggin wrote:


I think you are right that the NUMA domain is probably being too
constrictive of task balancing, and that is where the regression
is coming from.

For some workloads it is definitely important to have the NUMA
domain, because it helps spread load over memory controllers as
well as CPUs - so I guess eliminating that domain is not a good
long term solution.

I would look at changing parameters of SD_NODE_INIT in include/
asm-powerpc/topology.h so they are closer to SD_CPU_INIT parameters
(ie. more aggressive).

I ran with the following:

--- topology.h.orig 2005-11-08 13:11:57.000000000 -0600
+++ topology.h 2005-11-08 13:17:15.000000000 -0600
@@ -43,11 +43,11 @@ static inline int node_to_first_cpu(int
.span = CPU_MASK_NONE, \
.parent = NULL, \
.groups = NULL, \
- .min_interval = 8, \
- .max_interval = 32, \
- .busy_factor = 32, \
+ .min_interval = 1, \
+ .max_interval = 4, \
+ .busy_factor = 64, \
.imbalance_pct = 125, \
- .cache_hot_time = (10*1000000), \
+ .cache_hot_time = (5*1000000/2), \
.cache_nice_tries = 1, \
.per_cpu_gain = 100, \
.flags = SD_LOAD_BALANCE \

There was no improvement in performance. The schedstats from this run follow:

2516 sys_sched_yield()
0( 0.00%) found (only) active queue empty on current cpu
0( 0.00%) found (only) expired queue empty on current cpu
46( 1.83%) found both queues empty on current cpu
2470( 98.17%) found neither queue empty on current cpu


22969106 schedule()
694922 goes idle
3( 0.00%) switched active and expired queues
0( 0.00%) used existing active queue

0 active_load_balance()
0 sched_balance_exec()

0.19/1.28 avg runtime/latency over all cpus (ms)

[scheduler domain #0]
1153606 load_balance()
82580( 7.16%) called while idle
488( 0.59%) tried but failed to move any tasks
63876( 77.35%) found no busier group
18216( 22.06%) succeeded in moving at least one task
(average imbalance: 1.526)
317610( 27.53%) called while busy
15( 0.00%) tried but failed to move any tasks
220139( 69.31%) found no busier group
97456( 30.68%) succeeded in moving at least one task
(average imbalance: 1.752)
753416( 65.31%) called when newly idle
487( 0.06%) tried but failed to move any tasks
624132( 82.84%) found no busier group
128797( 17.10%) succeeded in moving at least one task
(average imbalance: 1.531)

0 sched_balance_exec() tried to push a task

[scheduler domain #1]
715638 load_balance()
68533( 9.58%) called while idle
3140( 4.58%) tried but failed to move any tasks
60357( 88.07%) found no busier group
5036( 7.35%) succeeded in moving at least one task
(average imbalance: 1.251)
22486( 3.14%) called while busy
64( 0.28%) tried but failed to move any tasks
21352( 94.96%) found no busier group
1070( 4.76%) succeeded in moving at least one task
(average imbalance: 1.922)
624619( 87.28%) called when newly idle
5218( 0.84%) tried but failed to move any tasks
591970( 94.77%) found no busier group
27431( 4.39%) succeeded in moving at least one task
(average imbalance: 1.382)

0 sched_balance_exec() tried to push a task

[scheduler domain #2]
685164 load_balance()
63247( 9.23%) called while idle
7280( 11.51%) tried but failed to move any tasks
52200( 82.53%) found no busier group
3767( 5.96%) succeeded in moving at least one task
(average imbalance: 1.361)
24729( 3.61%) called while busy
418( 1.69%) tried but failed to move any tasks
21025( 85.02%) found no busier group
3286( 13.29%) succeeded in moving at least one task
(average imbalance: 3.579)
597188( 87.16%) called when newly idle
67577( 11.32%) tried but failed to move any tasks
371377( 62.19%) found no busier group
158234( 26.50%) succeeded in moving at least one task
(average imbalance: 2.146)

0 sched_balance_exec() tried to push a task


I would also take a look at removing SD_WAKE_IDLE from the flags.
This flag should make balancing more aggressive, but it can have
problems when applied to a NUMA domain due to too much task
movement.

Independent from the run above, I ran with the following:

--- topology.h.orig 2005-11-08 19:32:19.000000000 -0600
+++ topology.h 2005-11-08 19:34:25.000000000 -0600
@@ -53,7 +53,6 @@ static inline int node_to_first_cpu(int
.flags = SD_LOAD_BALANCE \
| SD_BALANCE_EXEC \
| SD_BALANCE_NEWIDLE \
- | SD_WAKE_IDLE \
| SD_WAKE_BALANCE, \
.last_balance = jiffies, \
.balance_interval = 1, \

There was no improvement in performance.

I didn't expect any change in performance this time, because I
don't think the SD_WAKE_IDLE flag is effective in the NUMA
domain, due to the following code in wake_idle:

for_each_domain(cpu, sd) {
if (sd->flags & SD_WAKE_IDLE) {
cpus_and(tmp, sd->span, p->cpus_allowed);
for_each_cpu_mask(i, tmp) {
if (idle_cpu(i))
return i;
}
}
else
break;
}

If I read that loop correctly it stops at the first domain
which doesn't have SD_WAKE_IDLE set, which is the CPU domain
(see SD_CPU_INIT), and thus it never gets to the NUMA domain.

Thanks for the suggestions Nick. Andrew raises some
good questions that I will address tomorrow.

Cheers,
Brian

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/