RE: [3.2.16 -> 3.2.17 regression] High reported CPU load when idle

From: Doug Smythies
Date: Fri Jun 08 2012 - 13:02:16 EST


>> On 2012.05.30 07:54, Anders BostrÃm wrote:
> On 2012.06.05 08:35, LesÅaw KopeÄ wrote:

>> Well, I tested in single user mode, with very few processes running,
>> mostly init, getty, bash and top (+ a lot of kernel threads). And
>> 3.2.17 reported a load of >0.5 . Under the same conditions 3.2.16
>> typically reports 0.01 or 0.00 .

>> I don't know if 0.01 is *too* low, but it should be much closer to the
>> truth than >0.5.

I agree. However the not "idle" case needs to also be considered. For a
real load of 5.70 a reported load average of 0 is much further from the
truth than the 5.6 being reported now, for example.

> When the system is completely idle load drops to 0. I've also tried
> 3.2.17 with 556061b00c9f, but it makes no difference and in comparison
> to plain 3.2.17 load is the same even on a busy system.

> I can't explain why we're getting different results on the same kernels.

The different results are due to differences in the processes that are
running on those same kernels, and in particular the frequency at which
those processes do stuff and sleep. Where enough detail has been
available on various problem reports, I have always found much more CPU
activity than on my server system with no GUI. These have typically been
GUI based "desktop" linux systems. Where I have been able to figure
it out, the real "idle" load has been between 0.1 and 0.2 and reported
as about 0.8 to 1.2.

All of my analysis work for this reported load averages work has been
based on the assumption that the background load is close enough to 0 to
ignore. Obviously that assumption needed to be checked, [1]. Also see
the attached PNG file (also posted at [2]). (Summary: The same as LesÅaw)

By the way, I found and tested 5aaa0b7a2ed5b12692c9ffb5222182bd558d3146
It is similar (minimally tested).

I am certainly not an expert, and I find the load average area of the
code extremely difficult to follow and understand. That being said, I
think the root issue here is the 10 tick grace period. I think that
cpu idle enter exit transitions can not be ignored during this period,
and somehow needs to be accumulated towards the next sample time. So far,
I have been unsuccessful trying to help with a suggested solution. I will
continue to try.

Disclaimers:

My web pages and notes often refer to reported load averages to two
decimal places. I agree that is ridiculous. One should only expect
+- 0.1 to 0.15 at best, and for the 15 minute average, after settle
time. Worse for the shorter time constants.

It is hoped that readers understand that the 15 minute reported load
average never goes below 0.05 (after it has gone above that value once).
That is a simple finite number of bits integer math issue.

[1] http://www.smythies.com/~doug/network/load_average/background.html
[2] http://www.smythies.com/~doug/network/load_average/background_histograms.png

See also general related web notes at: http://www.smythies.com/~doug/network/load_average/index.html

Doug Smythies

Attachment: background_histograms.png
Description: PNG image