Re: [PATCH 1/1] sched: Make schedstats a runtime tunable that is disabled by default v4

From: Mel Gorman
Date: Wed Feb 03 2016 - 09:56:36 EST

Next message: Michal Hocko: "Re: [PATCHv2 2/2] mm: downgrade VM_BUG in isolate_lru_page() to warning"
Previous message: Alexandre TORGUE: "[PATCH 3/4] net: ethernet: stmmac: add support of Synopsys 3.50a MAC IP"
In reply to: Mel Gorman: "Re: [PATCH 1/1] sched: Make schedstats a runtime tunable that is disabled by default v4"
Next in thread: Srikar Dronamraju: "Re: [PATCH 1/1] sched: Make schedstats a runtime tunable that is disabled by default v4"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Wed, Feb 03, 2016 at 01:32:46PM +0000, Mel Gorman wrote:
> > Yes, but the question is, are there true cross-CPU cache-misses? I.e. are there
> > any 'global' (or per node) counters that we keep touching and which keep
> > generating cache-misses?
> >
>
> I haven't specifically identified them as I consider the calculations for
> some of them to be expensive in their own right even without accounting for
> cache misses. Moving to per-cpu counters would not eliminate all cache misses
> as a stat updated on one CPU for a task that is woken on a separate CPU is
> still going to trigger a cache miss. Even if such counters were identified
> and moved to separate cache lines, the calculation overhead would remain.
>

I looked closer with perf stat to see if there was a good case for reducing
cache misses using per-cpu counters.

Workload was hackbench with pipes and twice as many processes as there
are CPUs to generate a reasonable amount of scheduler activity.

Kernel 4.5-rc2 vanilla
Performance counter stats for './hackbench -pipe 96 process 1000' (5 runs):

54355.194747 task-clock (msec) # 35.825 CPUs utilized ( +- 0.72% ) (100.00%)
6,654,707 context-switches # 0.122 M/sec ( +- 1.56% ) (100.00%)
376,624 cpu-migrations # 0.007 M/sec ( +- 3.43% ) (100.00%)
128,533 page-faults # 0.002 M/sec ( +- 1.80% ) (100.00%)
111,173,775,559 cycles # 2.045 GHz ( +- 0.76% ) (52.55%)
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
87,243,428,243 instructions # 0.78 insns per cycle ( +- 0.38% ) (63.74%)
17,067,078,003 branches # 313.992 M/sec ( +- 0.39% ) (61.79%)
65,864,607 branch-misses # 0.39% of all branches ( +- 2.10% ) (61.51%)
26,873,984,605 L1-dcache-loads # 494.414 M/sec ( +- 0.45% ) (33.08%)
1,531,628,468 L1-dcache-load-misses # 5.70% of all L1-dcache hits ( +- 1.14% ) (31.65%)
410,990,209 LLC-loads # 7.561 M/sec ( +- 1.08% ) (31.38%)
38,279,473 LLC-load-misses # 9.31% of all LL-cache hits ( +- 6.82% ) (42.35%)

1.517251315 seconds time elapsed ( +- 1.55% )

Note that the actual cache miss ratio is quite low and indicates that
there is potentially little to gain from using per-cpu counters.

Kernel 4.5-rc2 plus patch that disables schedstats by default

Performance counter stats for './hackbench -pipe 96 process 1000' (5 runs):

51904.139186 task-clock (msec) # 35.322 CPUs utilized ( +- 2.07% ) (100.00%)
5,958,009 context-switches # 0.115 M/sec ( +- 5.90% ) (100.00%)
327,235 cpu-migrations # 0.006 M/sec ( +- 8.24% ) (100.00%)
130,063 page-faults # 0.003 M/sec ( +- 1.10% ) (100.00%)
104,926,877,727 cycles # 2.022 GHz ( +- 2.12% ) (52.08%)
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
83,768,167,895 instructions # 0.80 insns per cycle ( +- 1.25% ) (63.49%)
16,379,438,730 branches # 315.571 M/sec ( +- 1.47% ) (61.99%)
59,841,332 branch-misses # 0.37% of all branches ( +- 4.60% ) (61.68%)
25,749,569,276 L1-dcache-loads # 496.099 M/sec ( +- 1.37% ) (34.08%)
1,385,090,233 L1-dcache-load-misses # 5.38% of all L1-dcache hits ( +- 3.40% ) (31.88%)
358,531,172 LLC-loads # 6.908 M/sec ( +- 4.65% ) (31.04%)
33,476,691 LLC-load-misses # 9.34% of all LL-cache hits ( +- 4.95% ) (41.71%)

1.469447783 seconds time elapsed ( +- 2.23% )

Now, note that there is a reduction in cache misses but it's not a major
percentage and the miss ratio is only dropped slightly in comparison to
having stats enabled.

While a perf report shows there is a drop in cache references in
functions like ttwu_stat and [en|de]queue_entity but it's a small
percentage overall. The same is true for the cycle count. The overall
percentage is small but the patch eliminates them.

Based on the low level of cache misses, I see no value to using per-cpu
counters as an alternative.

--
Mel Gorman
SUSE Labs

Next message: Michal Hocko: "Re: [PATCHv2 2/2] mm: downgrade VM_BUG in isolate_lru_page() to warning"
Previous message: Alexandre TORGUE: "[PATCH 3/4] net: ethernet: stmmac: add support of Synopsys 3.50a MAC IP"
In reply to: Mel Gorman: "Re: [PATCH 1/1] sched: Make schedstats a runtime tunable that is disabled by default v4"
Next in thread: Srikar Dronamraju: "Re: [PATCH 1/1] sched: Make schedstats a runtime tunable that is disabled by default v4"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]