Database regression due to scheduler changes ?

From: Brian Twichell
Date: Mon Nov 07 2005 - 17:17:52 EST


Hi,

We observed a 1.5% regression in an OLTP database workload going from
2.6.13-rc4 to 2.6.13-rc5. The regression has been carried forward
at least as far as 2.6.14-rc5.

Through experimentation, and through examining the changes that
went into 2.6.13-rc5, we found that we can eliminate the regression
in 2.6.13-rc5 with one straightforward change: eliminating the
NUMA level from the CPU scheduler domain structures.

After observing this, we collected schedstats (provided below)
to try to determine how the scheduler behaves differently
when the NUMA level is eliminated. It appears to us that
the scheduler is having more success in balancing in this
case. We tried to duplicate this effect by changing parameters
in the NUMA-level and SMP-level domain definitions to
increase the aggressiveness of the balancing, but none of the
changes could recoup the regression.

We suspect the regression was introduced in the scheduler changes
that went into 2.6.13-rc1. However, the regression was hidden
from us by a bug in include/asm-ppc64/topology.h that made ppc64
look non-NUMA from 2.6.13-rc1 through 2.6.13-rc4. That bug was
fixed in 2.6.13-rc5. Unfortunately the workload does not run to
completion on 2.6.12 or 2.6.13-rc1. We have measurements on
2.6.12-rc6-git7 that do not show the regression.

One alternative for fixing this in 2.6.13 would have been to #define
ARCH_HAS_SCHED_DOMAINS and to introduce a ppc64-specific version
of build_sched_domains that eliminates the NUMA-level domain for
small (e.g. 4-way) ppc64 systems. However, ARCH_HAS_SCHED_DOMAINS
has been eliminated from 2.6.14, and anyways that solution doesn't
seem very encompassing to me.

So, at this point I am soliciting assistance from scheduler experts
to determine how this regression can be eliminated. We are keen
to prevent this regression from going into the next distro versions.
Simply shipping a distro kernel with CONFIG_NUMA off isn't a viable
option because we need it for our larger configurations.

Our system configuration is a 4-way 1.9 GHz Power5-based server. As
the system supports SMT, it shows eight online CPUs.

Below are the schedstats. The first set is with the NUMA-level
domain, while the second set is without the NUMA-level domain.

Cheers,
Brian Twichell

Schedstats (NUMA-level domain included)
----------------------------------------------------------------------
00:09:05--------------------------------------------------------------
2845 sys_sched_yield()
0( 0.00%) found (only) active queue empty on current cpu
0( 0.00%) found (only) expired queue empty on current cpu
157( 5.52%) found both queues empty on current cpu
2688( 94.48%) found neither queue empty on current cpu


23287180 schedule()
1( 0.00%) switched active and expired queues
0( 0.00%) used existing active queue

0 active_load_balance()
0 sched_balance_exec()

0.19/1.17 avg runtime/latency over all cpus (ms)

[scheduler domain #0]
1418943 load_balance()
112240( 7.91%) called while idle
499( 0.44%) tried but failed to move any tasks
80433( 71.66%) found no busier group
31308( 27.89%) succeeded in moving at least one task
(average imbalance: 1.549)
316022( 22.27%) called while busy
21( 0.01%) tried but failed to move any tasks
220440( 69.75%) found no busier group
95561( 30.24%) succeeded in moving at least one task
(average imbalance: 1.727)
990681( 69.82%) called when newly idle
533( 0.05%) tried but failed to move any tasks
808816( 81.64%) found no busier group
181332( 18.30%) succeeded in moving at least one task
(average imbalance: 1.500)

0 sched_balance_exec() tried to push a task

[scheduler domain #1]
922193 load_balance()
85822( 9.31%) called while idle
4032( 4.70%) tried but failed to move any tasks
70982( 82.71%) found no busier group
10808( 12.59%) succeeded in moving at least one task
(average imbalance: 1.348)
27022( 2.93%) called while busy
106( 0.39%) tried but failed to move any tasks
25478( 94.29%) found no busier group
1438( 5.32%) succeeded in moving at least one task
(average imbalance: 1.712)
809349( 87.76%) called when newly idle
6967( 0.86%) tried but failed to move any tasks
757097( 93.54%) found no busier group
45285( 5.60%) succeeded in moving at least one task
(average imbalance: 1.338)

0 sched_balance_exec() tried to push a task

[scheduler domain #2]
825662 load_balance()
52074( 6.31%) called while idle
17791( 34.16%) tried but failed to move any tasks
32839( 63.06%) found no busier group
1444( 2.77%) succeeded in moving at least one task
(average imbalance: 1.981)
9524( 1.15%) called while busy
1072( 11.26%) tried but failed to move any tasks
7654( 80.37%) found no busier group
798( 8.38%) succeeded in moving at least one task
(average imbalance: 2.976)
764064( 92.54%) called when newly idle
262831( 34.40%) tried but failed to move any tasks
409353( 53.58%) found no busier group
91880( 12.03%) succeeded in moving at least one task
(average imbalance: 2.518)

0 sched_balance_exec() tried to push a task


Schedstats (NUMA-level domain eliminated)
----------------------------------------------------------------------
00:09:03--------------------------------------------------------------
2576 sys_sched_yield()
0( 0.00%) found (only) active queue empty on current cpu
0( 0.00%) found (only) expired queue empty on current cpu
118( 4.58%) found both queues empty on current cpu
2458( 95.42%) found neither queue empty on current cpu


23617887 schedule()
1106774 goes idle
0( 0.00%) switched active and expired queues
0( 0.00%) used existing active queue

0 active_load_balance()
0 sched_balance_exec()

0.19/1.10 avg runtime/latency over all cpus (ms)

[scheduler domain #0]
1810988 load_balance()
153509( 8.48%) called while idle
680( 0.44%) tried but failed to move any tasks
104906( 68.34%) found no busier group
47923( 31.22%) succeeded in moving at least one task
(average imbalance: 1.658)
317016( 17.51%) called while busy
30( 0.01%) tried but failed to move any tasks
217438( 68.59%) found no busier group
99548( 31.40%) succeeded in moving at least one task
(average imbalance: 1.831)
1340463( 74.02%) called when newly idle
762( 0.06%) tried but failed to move any tasks
1092960( 81.54%) found no busier group
246741( 18.41%) succeeded in moving at least one task
(average imbalance: 1.564)

0 sched_balance_exec() tried to push a task

[scheduler domain #1]
1244187 load_balance()
111326( 8.95%) called while idle
8396( 7.54%) tried but failed to move any tasks
71276( 64.02%) found no busier group
31654( 28.43%) succeeded in moving at least one task
(average imbalance: 1.412)
39138( 3.15%) called while busy
220( 0.56%) tried but failed to move any tasks
34676( 88.60%) found no busier group
4242( 10.84%) succeeded in moving at least one task
(average imbalance: 1.360)
1093723( 87.91%) called when newly idle
15971( 1.46%) tried but failed to move any tasks
932422( 85.25%) found no busier group
145330( 13.29%) succeeded in moving at least one task
(average imbalance: 1.189)

0 sched_balance_exec() tried to push a task


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/