[RFC] sched: The removal of idle_balance()

From: Steven Rostedt
Date: Fri Feb 15 2013 - 01:13:45 EST


I've been working on cleaning up the scheduler a little and I moved the
call to idle_balance() from directly in the scheduler proper into the
idle class. Benchmarks (well hackbench) improved slightly as I did this.
I was adding some more tweaks and running perf stat on the results when
I made a mistake and notice a drastic change.

My runs looked something like this on my i7 4 core 4 hyperthreads:

[root@bxtest ~]# perf stat -a -r 100 /work/c/hackbench 500
Time: 16.354
Time: 25.299
Time: 20.621
Time: 19.457
Time: 14.484
Time: 7.615
Time: 35.346
Time: 29.366
Time: 18.474
Time: 14.492
Time: 5.660
Time: 25.955
Time: 9.363
Time: 34.834
Time: 18.736
Time: 30.895
Time: 33.827
Time: 11.237
Time: 17.031
Time: 18.615
Time: 29.222
Time: 14.298
Time: 35.798
Time: 7.109
Time: 16.437
Time: 18.782
Time: 4.923
Time: 10.595
Time: 16.685
Time: 9.000
Time: 18.686
Time: 21.355
Time: 10.280
Time: 21.159
Time: 30.955
Time: 15.496
Time: 6.452
Time: 19.625
Time: 20.656
Time: 19.679
Time: 12.484
Time: 31.189
Time: 19.136
Time: 20.763
Time: 11.415
Time: 15.652
Time: 23.935
Time: 28.225
Time: 9.930
Time: 11.658
[...]

With my changes making the average get better by a second or two. The
output from the perf stat looked like this:

Performance counter stats for '/work/c/hackbench 500' (100 runs):

199820.045583 task-clock # 8.016 CPUs utilized ( +- 5.29% ) [100.00%]
3,594,264 context-switches # 0.018 M/sec ( +- 5.94% ) [100.00%]
352,240 cpu-migrations # 0.002 M/sec ( +- 3.31% ) [100.00%]
1,006,732 page-faults # 0.005 M/sec ( +- 0.56% )
293,801,912,874 cycles # 1.470 GHz ( +- 4.20% ) [100.00%]
261,808,125,109 stalled-cycles-frontend # 89.11% frontend cycles idle ( +- 4.38% ) [100.00%]
<not supported> stalled-cycles-backend
135,521,344,089 instructions # 0.46 insns per cycle
# 1.93 stalled cycles per insn ( +- 4.37% ) [100.00%]
26,198,116,586 branches # 131.109 M/sec ( +- 4.59% ) [100.00%]
115,326,812 branch-misses # 0.44% of all branches ( +- 4.12% )

24.929136087 seconds time elapsed ( +- 5.31% )

Again, my patches made slight improvements. Down to 22 and 21 seconds at best.

But then when I made a small tweak, it looked like this:

[root@bxtest ~]# perf stat -a -r 100 /work/c/hackbench 500
Time: 5.820
Time: 28.815
Time: 5.032
Time: 17.151
Time: 8.347
Time: 5.142
Time: 5.138
Time: 18.695
Time: 5.099
Time: 4.994
Time: 5.016
Time: 5.076
Time: 5.049
Time: 21.453
Time: 5.241
Time: 10.498
Time: 5.011
Time: 6.142
Time: 4.953
Time: 5.145
Time: 5.004
Time: 14.848
Time: 5.846
Time: 5.076
Time: 5.826
Time: 5.108
Time: 5.122
Time: 5.254
Time: 5.309
Time: 5.018
Time: 7.561
Time: 5.176
Time: 21.142
Time: 5.063
Time: 5.235
Time: 6.535
Time: 4.993
Time: 5.219
Time: 5.070
Time: 5.232
Time: 5.029
Time: 5.091
Time: 6.092
Time: 5.020
[...]

Performance counter stats for '/work/c/hackbench 500' (100 runs):

98258.962617 task-clock # 7.998 CPUs utilized ( +- 12.12% ) [100.00%]
2,572,651 context-switches # 0.026 M/sec ( +- 9.35% ) [100.00%]
224,004 cpu-migrations # 0.002 M/sec ( +- 5.01% ) [100.00%]
913,813 page-faults # 0.009 M/sec ( +- 0.71% )
215,927,081,108 cycles # 2.198 GHz ( +- 5.48% ) [100.00%]
189,246,626,321 stalled-cycles-frontend # 87.64% frontend cycles idle ( +- 6.07% ) [100.00%]
<not supported> stalled-cycles-backend
102,965,954,824 instructions # 0.48 insns per cycle
# 1.84 stalled cycles per insn ( +- 5.40% ) [100.00%]
19,280,914,558 branches # 196.226 M/sec ( +- 5.89% ) [100.00%]
87,284,617 branch-misses # 0.45% of all branches ( +- 5.06% )

12.285025160 seconds time elapsed ( +- 12.14% )

And it consistently looked like that. I thought to myself, geeze! That
tweek did one hell of an improvement. But that tweak should not have, as
I just moved some code around. Things were only being called in
different places.

Looking at my change, I discovered my *bug*, which in this case,
happened to be a true feature. It prevented idle_balance() from ever
being called.

This is a 50% improvement! On a benchmark that stresses the scheduler.
OK, I know that hackbench isn't a real world benchmark, but this got me
thinking. I started looking into the history of idle_balance() and
discovered that it existed from the start of git (2005), and is probably
older (I didn't bother checking other historical archives, although I
did find this: http://lwn.net/Articles/109371/ ). This was a time that
SMP processors were just becoming affordable for the public. It's when I
first bought my own. But they were on small boxes, nothing large. 8 CPUs
was still considered huge then (for us mere mortals).

idle_balance() is the notion of when the CPU is about to go idle, go
snoop around the other CPUs and pull anything over that might be
available. But this pull is actually hurting the task more than helping,
as it would lose all its cache. Just letting the normal tick based load
balancing will save these tasks from constantly having their cache
ripped out from underneath them.

with idle_balance:

perf stat -r 10 -e cache-misses /work/c/hackbench 500

Performance counter stats for '/work/c/hackbench 500' (10 runs):

720,120,346 cache-misses ( +- 9.87% )

34.445262454 seconds time elapsed ( +- 32.55% )

perf stat -r 10 -a -e sched:sched_migrate_task -a /work/c/hackbench 500

Performance counter stats for '/work/c/hackbench 500' (10 runs):

306,398 sched:sched_migrate_task ( +- 4.62% )

18.376370212 seconds time elapsed ( +- 14.15% )


When we remove idle balance:

perf stat -r 10 -e cache-misses /work/c/hackbench 500

Performance counter stats for '/work/c/hackbench 500' (10 runs):

550,392,064 cache-misses ( +- 4.89% )

12.836740930 seconds time elapsed ( +- 23.53% )

perf stat -r 10 -a -e sched:sched_migrate_task -a /work/c/hackbench 500

Performance counter stats for '/work/c/hackbench 500' (10 runs):

219,725 sched:sched_migrate_task ( +- 2.83% )

8.019037539 seconds time elapsed ( +- 6.90% )

(cut down to just 10 runs to save time)

The cache misses dropped by ~23% and migrations dropped by ~28%. I
really believe that the idle_balance() hurts performance, and not just
for something like hackbench, but the aggressive nature for migration
that idle_balance() causes takes a large hit on a process' cache.

Think about it some more, just because we go idle isn't enough reason to
pull a runable task over. CPUs go idle all the time, and tasks are woken
up all the time. There's no reason that we can't just wait for the sched
tick to decide its time to do a bit of balancing. Sure, it would be nice
if the idle CPU did the work. But I think that frame of mind was an
incorrect notion from back in the early 2000s and does not apply to
today's hardware, or perhaps it doesn't apply to the (relatively) new
CFS scheduler. If you want aggressive scheduling, make the task rt, and
it will do aggressive scheduling.

But anyway, please, try it yourself. It's a really simple patch. This
isn't the final patch, for if this proves to be as big of a hit as
hackbench shows, the complete removal of idle_balance would be in order.

Who knows, maybe I'm missing something and this is just a fluke with
hackbench. I'm Cc'ing the guru's of the scheduler. Maybe they can show
me why idle_balance() is correct.

Go forth and test!

-- Steve

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1dff78a..a9317b7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2927,9 +2927,6 @@ need_resched:

pre_schedule(rq, prev);

- if (unlikely(!rq->nr_running))
- idle_balance(cpu, rq);
-
put_prev_task(rq, prev);
next = pick_next_task(rq);
clear_tsk_need_resched(prev);


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/