RE: [ 56/75] sched: Fix nohz load accounting -- again!

From: Doug Smythies
Date: Fri May 04 2012 - 18:49:14 EST


Just earlier this morning, an issue was raised against this patch on the related thread on Ubuntu launchpad.
The complaint is that reported load averages are now to high under conditions of "high" frequency enter into / exit from cpu idle conditions where the cpu is very lightly loaded.
Most of the testing I did was with medium to heavy load on the cpu and relatively short idle periods. This is the opposite.
For a quick test, I hacked up my test program, and was able to reproduce the issue.
I am still attempting to understand better and also determine the lower bound to "high" frequency (I think it is 25 Hertz, and scales from there down, but no proof yet). I'll also go backwards and test this scenario without the patch. I'll let this list know the results, but it might be a few days.
My quick test results are attached.

Doug Smythies

References (even though I have been told not to include links in e-mails to this list):

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/838811 <<< starting from post #52

version:
doug@s15:~/c$ uname -a
Linux s15 3.2.0-24-generic #37-Ubuntu SMP Wed Apr 25 08:43:22 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
doug@s15:~/c$ cat /proc/version
Linux version 3.2.0-24-generic (buildd@yellow) (gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5) ) #37-Ubuntu SMP Wed Apr 25 08:43:22 UTC 2012

-----Original Message-----
From: Greg KH [mailto:gregkh@xxxxxxxxxxxxxxxxxxx]
Sent: May-04-2012 13:43
To: linux-kernel@xxxxxxxxxxxxxxx; stable@xxxxxxxxxxxxxxx
Cc: torvalds@xxxxxxxxxxxxxxxxxxxx; akpm@xxxxxxxxxxxxxxxxxxxx; alan@xxxxxxxxxxxxxxxxxxx; Doug Smythies; LesÃïaw KopeÄ; Aman Gupta; Peter Zijlstra; Ingo Molnar; Kerin Millar
Subject: [ 56/75] sched: Fix nohz load accounting -- again!

3.3-stable review patch. If anyone has any objections, please let me know.

------------------

From: Peter Zijlstra <peterz@xxxxxxxxxxxxx>

commit c308b56b5398779cd3da0f62ab26b0453494c3d4 upstream.

Various people reported nohz load tracking still being wrecked, but Doug spotted the actual problem. We fold the nohz remainder in too soon, causing us to loose samples and under-account.

So instead of playing catch-up up-front, always do a single load-fold with whatever state we encounter and only then fold the nohz remainder and play catch-up.

Reported-by: Doug Smythies <dsmythies@xxxxxxxxx>
Reported-by: LesÃ=82aw Kope=C4=87 <leslaw.kopec@xxxxxxxxxxxxxx>
Reported-by: Aman Gupta <aman@xxxxxxxx>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@xxxxxxxxx>
Link: http://lkml.kernel.org/n/tip-4v31etnhgg9kwd6ocgx3rxl8@xxxxxxxxxxxxxx
Signed-off-by: Ingo Molnar <mingo@xxxxxxx>
Cc: Kerin Millar <kerframil@xxxxxxxxx>
Signed-off-by: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx>

---
kernel/sched/core.c | 53 +++++++++++++++++++++++++---------------------------
1 file changed, 26 insertions(+), 27 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2266,13 +2266,10 @@ calc_load_n(unsigned long load, unsigned
* Once we've updated the global active value, we need to apply the exponential
* weights adjusted to the number of cycles missed.
*/
-static void calc_global_nohz(unsigned long ticks)
+static void calc_global_nohz(void)
{
long delta, active, n;

- if (time_before(jiffies, calc_load_update))
- return;
-
/*
* If we crossed a calc_load_update boundary, make sure to fold
* any pending idle changes, the respective CPUs might have @@ -2284,31 +2281,25 @@ static void calc_global_nohz(unsigned lo
atomic_long_add(delta, &calc_load_tasks);

/*
- * If we were idle for multiple load cycles, apply them.
+ * It could be the one fold was all it took, we done!
*/
- if (ticks >= LOAD_FREQ) {
- n = ticks / LOAD_FREQ;
+ if (time_before(jiffies, calc_load_update + 10))
+ return;

- active = atomic_long_read(&calc_load_tasks);
- active = active > 0 ? active * FIXED_1 : 0;
+ /*
+ * Catch-up, fold however many we are behind still
+ */
+ delta = jiffies - calc_load_update - 10;
+ n = 1 + (delta / LOAD_FREQ);

- avenrun[0] = calc_load_n(avenrun[0], EXP_1, active, n);
- avenrun[1] = calc_load_n(avenrun[1], EXP_5, active, n);
- avenrun[2] = calc_load_n(avenrun[2], EXP_15, active, n);
+ active = atomic_long_read(&calc_load_tasks);
+ active = active > 0 ? active * FIXED_1 : 0;

- calc_load_update += n * LOAD_FREQ;
- }
+ avenrun[0] = calc_load_n(avenrun[0], EXP_1, active, n);
+ avenrun[1] = calc_load_n(avenrun[1], EXP_5, active, n);
+ avenrun[2] = calc_load_n(avenrun[2], EXP_15, active, n);

- /*
- * Its possible the remainder of the above division also crosses
- * a LOAD_FREQ period, the regular check in calc_global_load()
- * which comes after this will take care of that.
- *
- * Consider us being 11 ticks before a cycle completion, and us
- * sleeping for 4*LOAD_FREQ + 22 ticks, then the above code will
- * age us 4 cycles, and the test in calc_global_load() will
- * pick up the final one.
- */
+ calc_load_update += n * LOAD_FREQ;
}
#else
void calc_load_account_idle(struct rq *this_rq) @@ -2320,7 +2311,7 @@ static inline long calc_load_fold_idle(v
return 0;
}

-static void calc_global_nohz(unsigned long ticks)
+static void calc_global_nohz(void)
{
}
#endif
@@ -2348,8 +2339,6 @@ void calc_global_load(unsigned long tick {
long active;

- calc_global_nohz(ticks);
-
if (time_before(jiffies, calc_load_update + 10))
return;

@@ -2361,6 +2350,16 @@ void calc_global_load(unsigned long tick
avenrun[2] = calc_load(avenrun[2], EXP_15, active);

calc_load_update += LOAD_FREQ;
+
+ /*
+ * Account one period with whatever state we found before
+ * folding in the nohz state and ageing the entire idle period.
+ *
+ * This avoids loosing a sample when we go idle between
+ * calc_load_account_active() (10 ticks ago) and now and thus
+ * under-accounting.
+ */
+ calc_global_nohz();
}

/*


Doug Smythies 2012.05.04 CPU's enter and exit idle at high frequencies. but with very very low actual utilization.
The majority of previous testing was very high utilization.

Excerpt from "Top"

top - 11:57:04 up 1 day, 15:28, 3 users, load average: 0.91, 0.87, 0.76
Tasks: 144 total, 2 running, 142 sleeping, 0 stopped, 0 zombie
Cpu0 : 0.9%us, 8.8%sy, 0.0%ni, 90.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu1 : 1.0%us, 4.8%sy, 0.0%ni, 94.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu2 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu3 : 0.3%us, 0.3%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu4 : 0.7%us, 0.0%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu5 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu6 : 0.3%us, 2.6%sy, 0.0%ni, 97.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu7 : 0.3%us, 0.3%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 7956400k total, 1591044k used, 6365356k free, 292288k buffers
Swap: 8294396k total, 0k used, 8294396k free, 882488k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
15970 doug 20 0 4156 88 0 S 2 0.0 0:23.23 waiter
15972 doug 20 0 4156 88 0 S 2 0.0 0:23.24 waiter
15968 doug 20 0 4156 88 0 S 2 0.0 0:23.30 waiter
15971 doug 20 0 4156 88 0 S 2 0.0 0:23.36 waiter
15969 doug 20 0 4156 88 0 S 1 0.0 0:23.51 waiter
15986 doug 20 0 17332 1332 960 R 1 0.0 0:01.24 top
1 root 20 0 24724 2616 1344 S 0 0.0 0:01.42 init

Similar excerpt from test program output. Note: TOP resource time = user cpu + sys cpu times.

Child 0 lp 706 of 9999: Elapsed:1411.97 s. Delta: 2.00 s. user cpu: 5.03 s. sys cpu: 18.46 s. last sleep freq.: 250.000 Hz. average sleep freq.: 250.005 Hz.
Child 1 lp 706 of 9999: Elapsed:1411.98 s. Delta: 2.00 s. user cpu: 5.38 s. sys cpu: 18.31 s. last sleep freq.: 250.000 Hz. average sleep freq.: 250.004 Hz.
Child 2 lp 706 of 9999: Elapsed:1411.97 s. Delta: 2.00 s. user cpu: 5.02 s. sys cpu: 18.39 s. last sleep freq.: 250.000 Hz. average sleep freq.: 250.005 Hz.
Child 3 lp 706 of 9999: Elapsed:1411.97 s. Delta: 2.00 s. user cpu: 5.23 s. sys cpu: 18.35 s. last sleep freq.: 250.000 Hz. average sleep freq.: 250.005 Hz.
Child 4 lp 706 of 9999: Elapsed:1411.97 s. Delta: 2.00 s. user cpu: 5.25 s. sys cpu: 18.22 s. last sleep freq.: 250.000 Hz. average sleep freq.: 250.005 Hz.
Child 0 lp 707 of 9999: Elapsed:1413.97 s. Delta: 2.00 s. user cpu: 5.04 s. sys cpu: 18.47 s. last sleep freq.: 250.000 Hz. average sleep freq.: 250.005 Hz.
Child 1 lp 707 of 9999: Elapsed:1413.98 s. Delta: 2.00 s. user cpu: 5.39 s. sys cpu: 18.34 s. last sleep freq.: 250.000 Hz. average sleep freq.: 250.004 Hz.
Child 2 lp 707 of 9999: Elapsed:1413.97 s. Delta: 2.00 s. user cpu: 5.02 s. sys cpu: 18.41 s. last sleep freq.: 250.000 Hz. average sleep freq.: 250.005 Hz.
Child 3 lp 707 of 9999: Elapsed:1413.97 s. Delta: 2.00 s. user cpu: 5.24 s. sys cpu: 18.38 s. last sleep freq.: 250.000 Hz. average sleep freq.: 250.005 Hz.
Child 4 lp 707 of 9999: Elapsed:1413.97 s. Delta: 2.00 s. user cpu: 5.26 s. sys cpu: 18.24 s. last sleep freq.: 250.000 Hz. average sleep freq.: 250.005 Hz.

Resource math: ~23.5 seconds / 1414 seconds * 5 occurances ~~= 0.08 real load average.

Similar output from vmstat (but it was running at a different time than above):
(While I have a relatively high context switch rate and interrupt rate in this example, the issue exists over a wide range)

doug@s15:~/c$ vmstat -n 10 10
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
1 0 0 6366016 292320 882488 0 0 1 1 7 12 0 0 100 0
0 0 0 6366008 292320 882488 0 0 0 0 1319 2693 1 2 98 0
0 0 0 6366008 292320 882488 0 0 0 0 1270 2541 0 0 100 0
0 0 0 6366008 292324 882488 0 0 0 1 1268 2542 0 0 100 0
0 0 0 6366008 292324 882488 0 0 0 0 1313 2678 1 1 98 0
0 0 0 6366008 292324 882488 0 0 0 0 1306 2677 0 1 98 0
0 0 0 6366008 292324 882488 0 0 0 0 1339 2773 0 3 97 0
0 0 0 6366008 292324 882488 0 0 0 0 1347 2828 0 3 96 0
0 0 0 6366008 292324 882488 0 0 0 0 1324 2721 0 2 98 0
0 0 0 6366008 292328 882488 0 0 0 1 1341 2811 0 3 97 0